ANNE~NASTASI

Professor of Psychology, Fordham Universiry

Psyclwlvgical Testing

MACMILLAN
New York

PUBLISHING

CO.,

INC.

Collier Maonillan Publishers
London

All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the Publisher. Earlier editions copyright 1954 and © 1961 by Macmillan Publishing Co., Inc., and copyright © 1968 by Anne Anastasi.
MACMILLAN PUBLISHING Co., INC.

866 Third Avenue, New York, New York 10022
COLLIER MACMILLAN CANADA, LTD.

Librarlj of Congress Cataloging in Publication Data Anastasi, Anne, (date) Psychological testing. Bibliography: p. Includes indexes. 1. Mental tests. 2. Personality tests. I. Title. [DNLM: 1. Psychological tests. WM145 A534P] BF431.A573 1976 153·9 75-2206 ISBN O-<>2-30298<r3

I N A revised edition, one expects both similarities and differences. This edition shares with the earlier versions the objectives and basic approach of the book. The primary goal of this text is still to contribute toward the proper evaluation of psychological tests and the correct interpretation and use of test results. This goal calls for several kinds of information: ( 1) an understanding of the major principles of test construction, (2) psychological knowledge about the behavior being assessed, (3) sensitivity to the social and ethical implications of test use, and (4) broad familiarity with the types of available instruments and the sources of information about tests. A minor innovation in the fourth edition is the addition of a suggested outline for test evaluation (Appendix C). In successive editions, it has been necessary to exercise more and more restraint to keep the number of specific tests discussed in the book from growing with the field-it has never been my intention to provide a miniature Mental Measurements Yearbook! l:\evertheless, I am aware that principles of test co~struction and interpretation can be better understood when applied to~particular tests. Moreover, acquaintance with the major types of available tests, together with an understanding of their special contributions and limitations, is an es!>entialcomponent of knowledge about contemporary testing. For these reasons, specific tests are again examined and evaluated in Parts 3, 4, and 5. These tests have been chosen either because they are outstanding examples with which the student of testing should be familiar or because they illustrate some special point of test construction or interpretation. In the text itself, the principal focus is on types of tests rather than on specific instruments. At the same time, Appendix E contains a classified list of over 250 tests, including not only those cited in the text but also others added to provide a more representative sample. As for the differences-they loomed especially large during the preparation of this edition. Much that has happened in human society since the mid-1960's has had an impact on psychological testing. Some of these developments were briefly described in the last two chapters of the third edition. Today they have become part of the mairn;tream.;()f psychological' testing and have been accordingly incorpo~i-ted in the apprqpqate sections throughout the book. Recent changes in psychological Jesting that are reflected in the present edition can be delpribed on three levels: (1) general orientation toward testing, (2) Stlbm,IJ,tive and inethod()l~ical developments, and (3) "ordinary progress" w1)Q as the publiciitibn of new tests and revision of earlier tests.

Preface

Preface

vii

; An example of changes on the first level is the increasing awareness of ~e ethical, social, and legal implications of t~sting. In the present edilon, this topic has been expanded and treated 111 separate chapter early a b the book (Ch. 3) and in Appendixes A and B. A cluster of related evelopments represe~ts a bro~dening of.test u~es..Beside~ the tradi~ion~l 'pplications of tests 111 selectwn and diagnosIs, 111creasmg attention IS eing given to administering tests for self-kuowledge and self-develop~entl and to training individuals in the use of their own test res?lts. in ,lJecision aking (Chs. 3 and 4). In the same category are the contmumg m ~eplacementof global scores with multitrait profiles and the application bf classificationstrategies, whereby "everyone can be above average" in bne or more socially valued "ariables (Ch. 7). From another angle, rffortsare being made to modify traditional interpretations of test scores, ~n bothcognitive and noncognitive areas, in the light of accumulating psychological knowledge. In this edition, Chapter 12 brings together 'psychologicalissues in the interpretation of intelligence test scores, :touching such problems as stability and change in intellectual level on .overtime; the nature of intelligence; and the testing of intelligence in :earlychildhood, in old age, and in different cultures. Another example is provided by the increasing emphasis on situational specificity and person-by-situationinteractions in personality testing, stimulated in large partbythe social-learning theorists (Ch. 17). T~e second level, -covering substantive and methodological changes, is illustrated the impact of computers on the development, administraby "tioll,scoring, and interpretation of tests (see especially Chs. 4, 11, 13, 17, 18, W). The use of computers in administering or managing instructional pro/ramshas also stimulated the development of criterion-referenced t~~~lthough other conditions have contributed to the upsurge of a 'i!restin such tests in education. Criterion-referenced tests are discussed '1 ,. 'pally in Chapters 4,5, and 14. Other types of lllstruments that have to prominence and have received fuller treatment in the present n include: tests for identifying specific learning disabilities (Ch. inventories and other devices for use in behavior modification pro-' (Ch. 20), instruments for assessing early ch~ldhOod education 14), Piagetian "ordinal" scales (Chs. 10 and 14), basic education literacy tests for adults (Cbs. 13 and 14), and techniques for the ment of environments (Ch. 20). Problems to be considered in the , ment of minority groups, including the question of test bias, are ined from different angles in Chapters 3, 7, 8, and 12. the third level, it may be noted that over 100 of the tests listed in edition have been either initially pUblished or revised since the ication of the preceding edition (1968). Major examples include the arthy Scales of Children's Abilities, the WISC-R, the 1972 Stanfordnorms (with all the resulting readjustments in interpretations),

l..

I

c

Forms Sand T of the DAT (including a computerized Career Planning Program), the Strong-Campbell Interest Inventory (merged form of the SVIB), and the latest revisions of the Stanford Achievement Test and the Metropolitan Readiness Tests. It is a pleasure to acknowledge the assis~nce received from many sources in the preparation of this edition. The completion of the project was facilitated by a one-semester Faculty Fellowship awarded by Fordham University and by a grant from the Fordham University Research Council covering principally the services of a research assistant. These services were performed by Stanley Friedland with an unusual combination of expertise, responSibility, and graciousness. I am indebted to the many authors and test publishers who provided reprints, unpublished manuscripts, specimen sets of tests, and answers to my innumerable inquiries by mail and telephone. For assistance extending far beyond the interests and responsibilities of any single publisher, I am especially grateful to Anna Dragositz of Educational Testing Service and Blythe Mitchell of Harcourt Brace Jovanovich, Ioc. I want to acknowledge the Significant contribution of John T. Cowles of the University of Pittsburgh, who assumed complete responSibility for the preparation of the Instructor's Manual to accompany this text. For informative discussions and critical comments on particular topics, I want to convey my sincere thanks to Willianl H. Angoff of Educational Testing Service and to several members of the Fordham University Psychology Department, including David R. Chabot, Marvin Reznikoff, Reube~ M. Schonebaum, and 'Warren, W. Tryon. Grateful acknowledgment IS also made of the thoughtful recommendations submitted by course instructors in response to the questionnaire distributed to current users of the third edition. Special thanks in this connection am due to Mary Carol Cahill for her extensive, constructive, and Wide-ranging suggestions. I wish to express my appreciation to Victoria Overton of the Fordham University library staff for her efficient and courteous assistance in bibliographic matters. Finany, I am happy to record the contributions of my husband, John Porter Foley, Jr., who again participated in the solution of countless problems at all stages in the preparation of the book. A.A.

49 Confidentiality 52 Communicating test results 56 Testing and the civil rights of minorities ix 57 .CONTENTS PART 1 CONTEXT OF PSYCHOLOGICAL TESTING 1. NATURE AND USE OF PSYCHOLOGICAL TESTS What is a psychological test? 23 Reasons for controlling the use of psychological tests Test administration 32 Rapport 34 Test anxiet\' 37 Examiner ~nd situational variables 39 Coaching. practice. FUNCTIONS AND ORIGINS OF PSYCHOLOGICAL TESTING Current uses of psychological tests Early interest in classification and 3 of the mentally Q training retarded 5 The first experimental psychologists 7 Contributions of Francis Galton 8 Cattell and the early "mental tests" 9 Binet and the nse of intelligence tests Group testing 12 Aptitude testing 13 ~ Standardized achievement tests 16 Measurement of personality 18 Sources of information about tests 20 10 2. and test sophistication 41 3. SOCIAL AND ETHICAL OF TESTING IMPLICATIONS " User qualifications 45 Testing instruments and procedures 47 Protection of privacy .

EDUCATIONAL TESTING 398 410 217 Achievement tests: their nature and uses General achievement batteries 403 Standardized tests in separate subjects Teacher-made classroom tests 412 .s 177 Combining information from different tests 180 Use of tests for cl. RELIAB ILITY The correlation coefficient 104 Types of reliability 110 Reliability of speeded tests 122 Dependence of reliability coefficients on the sample tested Standard error of measurement 127 Reliability of criterion-referenced tests 131 Croup tests versus individual tests 299 Multilevel batteries 305 Tests for the college level and beyond 318 125 12. TESTS of test scores 94 FOR SPECIAL POPULATIONS 281 Infant and preschool testing 266 Testing the physically handicapped Cross-cultural testing 287 5.assification decisions 186 Statistical analyses of test bias 191 PART 4 TESTS OF SEPARATE 13.'55 Wechsler Preschool and Primary Scale of Intelligence 260 10.PART 3 TESTS OF GENERAL INTELLECTUAL LEVEL 9. PSYCHOLOGICAL INTELLIGENCE ISSUES IN TESTING Content validity 134 Criterion-related validity Construct validity 151 Overview 158 140 Longitudinal studies of intelligence 327. ITEM ANALYSl-S Item difficulty 199 Item validity 206 Internal consistency 215 Item analysis of speeded tests Cross validation 219 Item-group interaction 222 14. MEASURING MULTIPLE AInLJTIES APTITUDES 369 Factor analysis 362 Theories of trait organization MUltiple aptitude batteries Measurement of creativity 378 388 8. NORMS AND THE INTERPRETATION TEST SCORES Statistical concepts 68 Developmental norms 73 Within-group norms 77 Relativity of norms 88 Computer utilization in tile interpretation Criterion-referenced testing 96 TESTS OF Stanford-Binet Intelligence Scale 230 Wechsler Adult Intelligence Scale 245 Wechsler Intelligence Scale for Children 2.. INDIVIDUAL 4. VALIDITY: MEASUREMENT INTERPRET ATION AND Validity coefficient and error of estimate 163 Test validity and decision theory 167 Moderator variabll. Intelligence in early childhood 332 Problems in the testing of adult intelligence Problems in cross-cultural testing 343 Nature of intelligence 349 337 7.

.for industrial personnel Special aptitude tests 442 Testing in the profeSSions 458 "Objective" performance tests 588 Situational tests 593 SeH-concepts and personal constructs 598 Assessment techniques in behavior modification programs Observer reports 606 Biographical inventories 614 The assessment of environments 616 Diagnostic use of intelligence tests 465 Special tests for detecting cognitive dysfunction Identifying specific learning disabilities 478 Clinical judgment 482 Report writing 487 B.527 Interest inventories 528 Opinion and attitude measurement 543 Attitude scales 546 Assessment of values and related variables 552 19. OTHER Diagnostic and criterion-rdt:renced tests Specialized prognostic tests 423 Assessment in early childhood education 417 425 ASSESSMENT TECHNIQUES ~ OCCUPATIONAL \V Validation of industrial TESTING 439 tests 435 Short screening tests . AND VALUES ATTITUDES.496 506 Factor analysis in test development 510 Personality theory in test development 515 Test-taking attitudes and response sets Situational specificity 521 Evaluation of personality inventories 18. Guidelines on Employee Selection Procedures (EEOC) Guidelines for Reporting Criterion-Related and Content Validity (OFCC) PART 5 PERSON ALITY 17. SELF-REPORT TESTS INVENTORIES Content validation 494 Empirical criterion keying . PROJECTIVE TECHNIQUES Nature of projective techniques 558 Inkblot techniques 559 Thematic Apperception Test and related instruments Other projective techniques 569 Evaluation of projective techniques 576 .20. MEASURES OF INTERESTS.

Psychological Testing .PART 1 C01ltext of .

The selection and classification of industrial personnel represent another major application of psychological testing.tiOIlOfchildren with reference to their ability to profit from different types of school instruction. One of the first problems that stimulated the development of psychological tests was the identification of the mentally retarded. and other types of behavioral deviartts.9. From the assembly-line . schools are among the largest test users. is meant by a psychological test. Or perhaps the reader has served as a subject in an experiment in which standardized tests were employed. the diagnosis of academic failures. It would be easy enough to recall . in the armed services. The classifica. or in the personnel office. A strong impetus to the early development of tests was likewise provided by problems arising in education. ' Basically. the delinquent. the function of psychological tests is to measure . the detection of int~i1ectual deficiencies remains an Important application of certain types of psychological tests. Psychological testing is a relatively young branch of one of the youngest of the sciences.n~L_ 1Jetween individuals or between the reactions of the same individual on different occasions. the identi£ication of the intellectually retarded on the one hand and the gifted on the other. a test the reader himself has taken in school. To this day. Related clinical uses of tests include the examination of the emotionally disturbed. in the counseling center. This would certainly not have been the case fifty years ago. in college. and the s~~ction of applicants for professional and other special schools are among the many educational ~ uses of tests. the educational and vocational counseling of high school and college students. At present.:iffe~~~.CHAPTER 1 Functions and 01~igiTlS of Psycllological TeStiTlg A NYONE reading this book today could undoubtedly illush'ate what .

It is clearly evident that psychological tests are currently being employed in the solution of a wide range of practical problems. so that test scores may be properly int~rpreteaTnt1leli ht of other back ound' rmatiQn a out the m IVI un. European umversities relied on formal examinations in awarding degrees and honors. job assignment. it follows that some knowledge of such tests is needed for an adequate understanding of most fields of contemporary psychology. operator or filing clerk to top management. ~nd the investigationfijo]ogical and cUltural factors associated WIth 6ehavioral differences. usuall • re uires that the t!. Some acquaintance with the lead·' ing current tests is necessary in order to understand references to the use of such tests in the psychological literature. Today a familiarity with tests is required. From their beginnings in the middle ages. Within this framework. not only b~' those who give or construct tests.4 COllfcl't of Psychological Testing make the individual either n skilled examiner and test administrator or an"experf on test construction. It is directed.Anastasi (1965) for historical antecedents of the study of individual differences. neglect. what they can be expected to accomplish. As illustrations. Subsequently. testing was an established adjunct to the educational process. on the use of tests to enhance self-understanding and personal development. . Similarly. To identify the major developments that shaped contemporary testing. It is primarily with this end in view that the present book has been prepared. DuBois (1970) for a brief but comprehensive history of psychologi~l tClsting. Tests were used to assess the mastery of physical as well as intellectual skills. the measurement of group:' differences. There is growing emphasis. and the influence of noise on performance. To be sure. evertheless. the outcomes of psychotherapy. too. the relative effectiveness of different educational procedures. we need go no farther than the nineteenth century. with its interweaving of testin and t~hin has mch i mmon with toda 's rrograme earning. A brief overview of the historical antecedents and origins of psychological testing will provide perspective and should aid in the understanding of present-day tests. One should not. transfer. not to the test specialist. Among the ancient Greeks.' The direction in which contemporary psychological testing has been progressing can be clarified when considered in the light of the precursors of such tests. Prior to that time. From simple beginnings in "Vorld 'War I. The special limitations as well as the advantages that characterize current tests likewise become more intelligible when viewed against the background in which they originated. And a proper evaluation and interpretation of test results must ultimately rest on a knowledge of how the tests were constructe<l. See also Boring (1950) and Murphy and Kovach (1972) for more general backgrq~md. and even torture had been the common lot of these unfortunates. the effective employment of tests in many of these situations. however. promotion. The use of tests in counseling has gradually broadened from a narrowly defined guidance regarding educational and vocational plans to an involvement with all aspects of the person's life. DuBois (1966) gives a provocative and entertaining account of the system of civil service examinations prevailit:\g in the 'Chinese empire for some three thousand years. lose sight of the fact that such tests are als? serving important functions in basic research Nearly all problems in differential psychology. reference may be made to studies on the nature and extent of individual differences. for example.:ts he used as an adjunct to s -i u interviewing. It is to these developments that we now turn. Emotional wellbeing and effective interpersonal relations have become increasingly prominent objectives of counseling. and what are their peculiar limitations. and . the impact of community programs. testing constitutes an important part ~ total personnel program. the identification of psychological traits. Pefers~n (1926~. but to the general student of psychology. The book is not designed to EARLY INTEREST IN CLASSIFICATION AND TRAINING OF THE MENTALLY RETARDED The nineteenth century witnessed a strong awakening of interest in the humane treatment of the mentally retarded and the insane. there is scarcely a type of job for which some kind of psychological test has not proved helpful in such matters as hiring. es eciiill-"Tri('Onnection with high-level jobs. ridicule. research on test development has been continuing on a large scale in all branches of the armed services. require testing procedures as a means of gathering data. The roots of testing are lost in antiquity. psycholOgical tests provide standardized tools for investigating such varied problems as life-span developmental changes within the individual. however. but by the general psychologist as well. test scores are part of the information given to the individual as aids to his own decision-making processes. 'the Socratic method of teaching.asurement of individual differences made possible by well-constructed tests is an essential prerequisite. the scope and variety of psychological tests employed in military sihlations underwent a phenomenal increase during World War II. A closely related application of psychological testing is to be found in the selection and classification of military personnel. or termination. From the many different uses of psychological tests. For all such areas of research-and for many others-the precise mt>. With the growing concern for the proper care of mental I A more detlliled account of the early origins of psycllOlogical tests can be found in Goodenough (1949) and J.

In the effort to develop some system for claSSifying the different degrees and varieties of retardation"Esguiroi tried several procedures but concluded that the individual's use of language provides the m05t de endable criterion of his intellectual level. of which more will be said Jal'er. Esquirol also pointed out that there an! many degrees of mental retardation. This emphasis on sen~ory phenome~a was in tU!'l1reflected in the nature of the £rst psychologICal tests.. as in many other phases of their work. where many of the early experimental psychologists received their training. the founoers of experimental psychology reBected the influence of their backgrounds in physiology and physics. First it was necessary to differentiate between the insane and the mentallv retarded. What is probably the first explicit statement of this distinction is to be found in a two-volume work published in 1838 by the French physician Esquirol (1838). the fact that one individual reacted diHerently from another when observed under identical co~ditions was regarded' as a form of -etror. With his fellow members of the Society for the Psychological Study of the Child.:es. The establishment of many special institutions for the care of the mentally retarded in both Europe and America made the need for setting up admission standards and an objective system of classification especially urgent.. in which the individual is required to insert variously shaped blocks into the corresponding recesses as quickly as possible.s~ndardiz~& conditions was . Binet stimulated the Ministry of Public Instruction to take steps to improve the condition of retarded children. This appointment was a momentous event in the history of psychological testing.tal:6hed the nrst school devoted to the education of mentally reta . the bnghtness or color oEthe sUtr~. Man~. auditory. H. This was the attitude toward individual differences that prevailed in such laborotodes as that founded by '''undt at Leipzig in 1879. The presence of such error. concerned \vith the measurement of individual'differences. . The important part verbal ability plays in our concept of intelligence will be repeatedly demonstrated in subsequent chapters. or individual variability. For example.dure eventually became one of the special earmarks of psychological tests.. Thus. who pioneered in the training of the mentally retarded. be assigned to special classes (T.. Individual differences were either ignored or were accepted as a necessary evil that limited the applicability of the generalizations.~:ding field could mar~edly alter the appearance of a visu~J s~mulU~:". It was the uniformities rather than the differences in behavior that were the focus of attention. rendered the generalizations approximate rather than exact.ase or decrease the speeg 'i\t the subject's response. The former manifested emotional disorders that might or might not be accompanied by intellectual deteriomtion from an initially normal level. . varying along a continuum from normality to low-grade idiOCy. and in 1837 he. the French psychologist Alfred Binet urged that children who failed to respond to normal schooling be examined before dismissal and. Or agam.1\h~portance of makmg observations on all subjects un4i~. where his ideas gaine _ ide recognition. if considered educable. By these methods..The earlv ps~'chological experiments brought out the need for rigorous control of the conditions under which observations were made. More than half a century after the work of Esquirol and Seguin. the \\'?rding of directions given to the subject in a reaction-time experiment mIght appreci~bly incre.!fu1svividly demonstrated: Such standardization of proce. S. St:ilI another way in which nineteenth-century experimental psychology Influenced the course of the testing movement may be noted. 1973). In their choice of topics. in which over one hundred pages are de\'oted to mental retardation. Of special significance are the contributions of another French physician. severely retarded children are given intensive exercise in sensory discrimination and in the development of motor control. SeO'uin (1866) eXIJerimented for many " vears with what he v ~ termed the physiological method of training. Having rejected the prevalent notion of the ineurability of mental retardation . An example is the Seguin Form Board. . to which Binet was appointed.of the sense-training and muscle-trainirJg techniques currently in use in institutions for the mentally retarded \vere originated by Seguin. In 1848 he emigrated to America.6 Context of Psychological Testing Functions and Origins of Psychological Testing 7 deviates came a realization that some uniform criteria for identifying and classifying these cases were required. A specific outcome was the <'stablishment of a ministerial commission for the study of retarded children. It is meres mg to note t at current criteria 0 menta retardation are also largely lingUistic ant! that present-day intelligence tests are heavily loaded ~vith Yerbal content. in general. the latter were characterized essentially by i~tellectual defect that had been present from birth or early infancy. Some of the procedures developed by Seguin for this purpose were 'eventually incorporated into performance or nonverbal tests of intelligence. and~ other sensory stimuli and \vith simple reaction time. The principal aim of psychologists of that period was the fommlation of generalized descriptions of human behavior." ~hildren. The ~arly experimental psycholOgists of the nineteenth century were not. as will be apparent in subsequent sections. Wolf.egll~. The problems studied in their laboratories were concerned largely with sensitivit~ to ~al.

.Functions and Ol'igills of Psychological Testing 9 It "'as the English biologist Sir Francis Galton who . For his doctorate at Leipzig.p!i<ck{t<:1. was . reaction time.18~4wh~re. Such test series were administered to schoolchilqren. he completed a dissertation on individual differences in reaction !ime. for example. Galton himself devised most of the sun pIe tests admIDlstered at hIS anthropometric laboratory. cold. While lectming at Cambridge in 1888.I. and pain-an observation that furtller strengthene5iYnis ~nviction that sens~ry discriminative capacity "would on the whole' be highest among the mtellectualh.~ bolst. to college o students in the effort to determine their irteilectuall~yel. A .The only information that reaches us concernmg outward events appeals to pass through the avenue of our senses. Calton t~a 'ize t e need for measuring the characteristics of related and unrelated persons. and the like. He thereby extended enormously the application of statistical procedures to the analysis of test data. I In his choice of tests. In this respec. college students'.. l -.~J preCiSIOnand accuracy. The tests. whereas the development of objective measures1-<=~. sisters. primarily r~sponsible for launching the testing movem~l~t: A umfY~lg.fu.~' he was partly influenced hy the theories of L?cke.. p.A few attempts to evaluate such early tests yielded very discOuraging results: The individual's Rerform~Dce showed little correspondence from one test to another (Sharp. the exact degree of resemblance bet:w'een p~ren~s and offspring. by . 29).-(~ i. which had to be administered individually. 1894·~. 1926... 1883.e~ U-U. despite Wundt's resistance to this t'ype of investigation.discrimination of len h. the term "mental test'.pa) mg threepence.ntellectual functions could he Qbt<}ined through tests of sensorv cis. 1883. the most eminent of whom was Karl Pearson. At the Columbian Exposition Jield in Chicago in 189~. .(. memory. or twins. where it operated for six years. and the n~ore per~ptive the sen~es are of difference.t.:mual . reaction time. the larger is the field upon which our Judgment and 10telligence can act" (Calton.""V'. Cattell's own interest in the measurement of individual differences was reinforced bv contact with Calton. speed of movement. This article described a series of tests that were beinO' administered anlluallv. the Galton whistle for determmlllg the hlghest au i e pitch.'. 1~1899. On his return to America. and misccllaneous adults.it-r I for the more complex functions seemed at that time a well-nigh hopeless r:YL-' task.sth~tic discrimimltion.oset up an anthropo~ctric laboratory at the International EXposI~on of . This phase of Galton's work has been carried forward by many of his students.. ~'.. ~e al<.. C~lt~n !lad. By such methods. It was Calton's belief that tests of sensory discrlrmnatlOn could serve as a means of gauging a person's intellect.'. factor ~n Calton's numerous and vaI'ied research activities was hiS }nterest llL 'humaJ. and graduated series of weights for measurin? k~ne.. . the nrst large. Peterson. !1~ Catten's tests were typical of those to be found in a number of test series developed during the Jast decade of the nineteenth century. Philippe. Thus Galton wrote: .f<. Cattell's pI'eference for such tests was also ~tl<-. cousins. .rheredit ".. Cattell was active both 'in the. Examples include the Cal~o~ bar for . An especially prominent position in the development of psychological testing is occupied by the American psychologist James McKeen Cattell.U4-~e. 1. many of which are still familiar either in ~heir original or in modified forms.~~ c~pination and reaction time. keenness of vision and of hearing.:~lso noted that idiots tend to be defective in the ability to discrlmmaJe·:heat. Calton was mstrument~l ' in inducing a number of educational institutions to keep systematic anthropometric recOl:ds on their students.:as. the laboratory was transferred to South Kensmgton Museum.e~ed by the fact that simple functions could be measured with . 1901). Galton selected and adapted a n~mber of techniques previously derived ~y m~thematicians. In the course of his imestigations on heredity. used for the £rst time in the psychological literature.rther contribution of Galton is to be found in hiS development of statistical methods for the analysis of data on individual differences. Only in this way could he discover. Galton also pioneered in the application of rating-sca~c ~nd ques~lOnnaire methods as well as in the use of the free associatIon techmque subsequently ~mployed for a wide ~arietyof purposes.~ _ In an article written by Cattell in . and other simple sensorimotor functions. With this end 11l View. sensiti~ty to pain. 27).890. motor.\'. Whe~l the exposition closed..'rothers and .ablest" (Galton. visitors could be measured 111 ce~yslcal traIts and could take tests of keenness of vision and hearing. ' .establishment of laboratories for experimental psychology and in the spread of the testing movement. The newly established science of experimental psychology and the still newer testing movement merged in Cattelfs work. muscular strength. weight discrimination. systematic body of data on individual differences in simple psychological processes was gradually aceu~ulated.. Cattell shared Galton's view that Jl measure of/M-.M ". .<-lA. Jastraw set up an exhibit at which visitors wete"'iIllitted to take tests of sensory. and simple perceptual processes and: to compare tlieir skill with the norms (J. These techniques he put in such form as to permit theIr use by the mathematically untrained investigator who might wish to treat test results quantitatively. Wissler. included measures of muscular strength. and it exhibited little or no . London..V1.

Oehrn (1889). however. was the only one that showed a clear correspondence with the children's scholastic achievement. Kraepelin (1895). 1894) or academic grades (Wissler. some unsatisfactory tests from the earlier scale were eliminated. 1896). consisted of 30 problems or tests arranged in ascending order of difficulty. attention. and hand form. 1891-1892. \\Tolf. suggestibility.nt. The 1905 scale was presented as a preliminary and tentative instrument. because the scientific ~Om!J1l1nity was not ready for it. and the analysis of handwriting. Gilbert. with speCial emphasis onJ. the Binet-Simon tests attracted wide > Goodenough (1949. H. A. 1973). and no precise objective method for arriving at a total score was formulated. facial. ously cited commission to study procedures for the education of retarded children. memory span. Thus. In the various translations and adaptations of the Binet scales. and the scale was extended to the adult level Even prior to the 1908 revision. a pupil of Kraepelin. Which Binet regarded as essential components of intelligence. 1901). and motor functions in an investigation on the interrelations of psychological functions. and all tests were grouped into age levels on the basis of the performance of about 300 normal children between. pp. covering such functions as memory. Binet's own scale was in~ed by the work oE some oE ~is contemporaries. who was interested primarily in the clinical examination of psychiatric patients. . The tests. were designed to measure practice effects. the Italian psychologist Ferrari and his students were interested primarily in the use of tests with pathological cases (Guicciardi & Ferrari. S. This scale. In these tests we can recognize the trends that were eventually to lead to the development of the famous Binet intelligence scales. however. In this scale. perhaps. and many others. and to some mentally retarded children and adults. known as the 1905 seale. the ages of 3 and 13 Years. and sentence completion to schoolchildren. prepared the first Binet-Simon Scale (Binet & Simon. H. the Minister of Public Instruction appointed ~inet to the previ- .. Many approaches were tried. since individual differences are larger in these functions. Binet and Henri criticized most of the available test series as being too largely sensory and as concentrating unduly on simple. Wolf. 2l y~aTs befor~ the appearance of the 1908 Binet-Simon Scale. In an article published in France in 1895. administered tests of arithmetic computation. even though crude. comprehension. the year of Binet's untimely death. The difficulty level was determined empirically by administering the tests to 50 normal children aged 3 to 11 years. In the second. The tests were designed to cover a wide variety of functions. E. Partly because' of the limited circulation of the journal 'nd partly. no fundamental changes were introduced. comprehension. 50-51) notes that in 1881. Ebbinghaus (1897). had emploY€id tests of perception. memory. aesthetic appreciation. notably Blin and Damaye. who prepared a set of oral questions from which they derived a single global score Eor eaclrdiild (T. Another German psychologist. A few years earlier. The test series they devised ranged from physiological measures and motor tests to apprehension span and the interpretation of pictures. specialized abilities." Since mental age is such a simple concept to~rasE> the introduction of this term undoubtedly did much to popularize intelligence testing. great precision is not necessary. or 1908. J. the number of tests was increased. Although sensory and perceptual tests were included.> Binet himself. 1973). and susceptibility to fatigue and to distraction. sentence completion.udgmt. Like Kraepelin. A third revision of the Binet-Simon Scale appeared in 1911. scale. imagination. a much greater proportion of verbal content was found in this scale than in most test series of the time. Minor revisions and relocations of specific tests were instituted. in the 3-year level were placed all tests passed by 80 to 00 percent of normal 3-year-olds. the significance of this age-scale concept passed unnoticed at the time. all tests similarly passed by normal 4-yearolds. T en a specific situation arose that brought Binet's efforts to imme(]iate practical fruition. avoided the term "mental age" because of its unverified developmental implications and preferred the more neutral term "mental level" (T. association. The results. the term "mental age" was commonly substituted for "mentalleveI. employing chiefly simple arithmetic operations. More tests were added at several year levels. An extensive and varied list of tests was proposed. prepared a long series of tests to measure what he regarded as basic factors in the characterization of an individual. in the measurement of the more complex functions. Chaille publi!iheq in the New Orleans Medical a~d Surgical Journal a series of tests for infan~ 11l7anged according to the a!1:eat whIch the tests are commonly passed. They argued further that. The child's score on the entire test could then be expressed as a mental level corresponding to the age of normal children whose performance he equaled.ls of Psychological Testing 11 relation to independent estimates of intellectual levC:'1 ased on teachers' b ratings (Bolton. and reasoning. In 1904. memory. 1905). including even the measurement of cranial. led to a growing conviction that the direct. in the 4-year-Ievel. A number of test series assembled by European psychologists of the period tended to cover somewhat more complex functions. It was in connection 'with the objectives of this commission that Binet. The most complex of the three tests. in collaboration with Simon.10 Context of PSlJc11010gical Testing Functions and Origi. Binet and his co-workers devoted many years to active and ingenious research on ways of measuring intelligence. measurement of com lex 1 fence a unc ons 0 ere t e greatest promise. and so on to age 13.

IJ:!. Large-sc~le test109 progra~ns: previously impossible. a number of diHerent revisions were prepa. was first used. like the first Binet scale.t~t are ot p. When the United States entered l)!orld 'Var I in 1917. It was in this setting that the first group intelligence test was developed. th~ tests cov~red abilities . the most famous of which is the one developed under the direction of L. Schoolteachers began to give intelligence tests to thcir classes. was the introduction of multiple-choice and other "objective" item types. the testi boom of the twenties. which extended the scale downward to the age level of 3 months (Kuhlmann. the Army tests were released for cmhan use. in terms of the type of mformation these tests are able to yield. This committee. Such tests are essentiallv clinical instruments. to a lesser extent. which he designed while a student in one of Terman's graduate courses. This scale represents one of the earliest efforts to develop preschool and infant tests of intelligence. b. It was in this test that the intelligence quotient (IQ). were now being launched with ~est~ul optimIsm. Soon group mtelhgence tests were being devised for all ages and types of ~ersons. ~he tests failed to meet unwarranted expectations" skepticism and hostiht)' toward all testing often resulted. JJ1US. such as prisoners. To be sure. too. A major contribution of Otis's test.' • cases. most mtelhgence tests were primarily measures of verbal ability and. Because group. Of special interest. realized that more'precise designations.~~~va:s relevant to many admmistrative decisions. the latest of which are even now in use. In this task. 1916). tests were designed as mass testing lUsh uments. It soon became apparent that such tests were quite limited in theIr . were undertaken. Yerkes. Such informati~. Il1telhgence test was a misnomer. as well as all their revisions. or mtio between mental age and chronological age.t Stanford University. Group testing. and known as the Stanfmd-Binet (Terman. of the ability to handle numerical and other abstract and symb~~ic re~ations.cove~age.movement underwent a tremendous spurt of growth.--logical test mg. The tests finally developed by the Army psychologists came to be known as the ~rm""yAlpha and the Army Beta The former was designed fo~ g~n~ral routine te~ting.ut they also sVVed as ~dels for most group intelligence tests. or admission to officer-training camps. assignment to different types of sel'vicei. fact. Not only did the Army Alpha and Army Beta themselves pass through many revisions. And soon the general public became IQ-conscious. 1912). ~ IS ~lthough intelligence tests were originally designed to sample a wide vanety of ~unctions in order to estimate the individual's general intelIectua~ level. including rejection or discharge from military service. Gr~dually psychologists eame to recogni~e that the ~erm . a committee was appointed by the American Psychological Association to consider ways in which psychology might assist in the conduct of the war. For these and other reasons. Not all important functions were represented. nized the need for the rapid classification of the million and a ha1f recruits with respect to general intellectual level.iclual scales in the sense that the\" can be administered to onlY one person at a time. The Binet tests.12 Context of Psyc11010gical Testing Functions and Origins of Psyc1101ugical Testing 13 attention among psychologists throughout the world. In Americ. from preschool children to graduate students.l. Translation~ and adaptations appeared in many lang. Man\' of the tests in these scales require . Coll~ge studen~s were routinely examined prio~ to admission.oral re~ponses from the subject or n~cessitate the manipulation of materials.rime importance in our culture. Some call for individual timing of responses. and especially on an unpublished group intelligence test prepared by ~rthur S. based on the indiscriminate use of tests i? ma~ have ~one as much to retai' as to ad\'ance the progress of psvcho. they not only permItted the simultaneous examination of large groups but also simplified the instructions and adminish'ation procedu~es so as to demand a minimum of training on the part of the exammer. the Ar-m~' psychologists drew on all available test materials. That the tests were still crude instruments was often f?rgotten in the rush of gathering scores and drawing practical conduslO~Sfrom the ~esults. t~e latter was a nonlanguage scale employed WIth Illiterates and wIth foreign-born recruits who were unable to take a tcst in English. "--T~e application of such group intelligence tests far outran their technical Improvement. The te~ting . suited to the intensive study of individual J . recog. such tests are not adapted to group administration.lUlq be prefer- . which hc turned over to the Army. ~t Terman a. is the first Kuhlmann-Binet revision. Another characteristic of the Binet type of test is that" it requires a highly trained examiner. Shortly af~e~ the temunatlOn of "Vorld War I. Extensive studies of specIal adult groups. The latest revision of this test is widely employed today and will be more full\' considered in Chapter 9.•. 'Vhen. are indil. was developed to meet a pressing practical need. under the direction of !lobert 1 1. Otis.red. B~ It was.uages. since only certain aspects of mtelligence were measured by such tests. Both test~ w~re suitable for administratio~ to large groups. w<.

This shift ill terminology was made in l'ec:ognition of the fact that mallY so-called intelligence tests measure that combination of abilities demanded by academic work. . Thus.psychologist Charles Spearman (1904. Among the most widely used are tests of.\yelikewise ~en 4..jis'a result. _ ' vocationa counseling and in the selection and classification of industrial and military ersonn~1. 1927) during the £lrst decade of the present century. for example. For example. Statistical studi('s on the nature of intelligence had been explonng the iflterrelatiol1s among scores obtained by many persons on a . A.n educati0l1~l and vocational counselmg and in personnel' selectioll and' cJassincadqIl. For the present. In the Air Force.~h.~. wlth crude and often errODl:'OUSesults from intelligence tests. it will suffice to note that the data gathered by such procedures have indicated the presence of a Dumber of rebtiyely . or vice versa. all items involving words might prove difficult for a particular individual. such as spatial. in vary~ng proportions. and perce~tual speed. a point of terminology shoul\!l be clarified. For example. u tip e ap u e atteries represent a relatively late development in the testing field. radio operators. Some of these traits were represen'ted. Such investigations were begun by the English . Examples of such butteries will be discussed in Chapter 13. A report of the batterics prepared in the Air Force alone occupies at least nine of the nineteen volumes devoted to the aviation psychology program during 'Vorld War II (Anny Air Forces. use and are being widely appliel:l\. This . Such a practice is not to be general1~' recommended. E\'l'n prior to Vvorld War I. Subsequent methodological developments. Nearl~' all have appeared since 1945. While the practical apl)lication of tests demonstrated the l1~. ~fuch of the test research conducted in the armed services was based on factor analysis and was directed toward the construction of mu. frequently utilized such interc~l11parisons in order to obtain 1110re insight into the individual's psychological make-up. t at c inicians a een tr\'ing for matiy years to .~~mber of multiple aptitude batteries !rl. as well as on that of other American and English ll1veshgators.ltiple aptitude batteries. the obtained diffl:'rence betwcen subtest scores might be reversed if the individual were retestE'd on a different day or with another foml of the same test. Verbal comprehenSIOn and numerical reasoning are examples of this tvpe of trait.ide variety of different tests. special battent's were constructed for pilots. for example. If such intraindividual comparisons are to be made. since the multiple aptitude batteries cover some of the traits not ordinarily me u e JlI IJ1e 1 ence tests. or traits. 111 whlch the items ar~mmonly segregated into subtests of relath'e1\.14C\\' items to yield a stable or reliable estimate of a specific ability:.J~d also be noted. 194i). such internal variability is also discernible on a test like the Stanford-Binet. a number of tests that would probably have been caned intelligence tests during the twenties later came to be known as scholastic aptitude tests.homogeneous content.14 Context of Psyclwlo{!. ho. ps\'ch~logists had begun to recognize the need for tests of spE'cial aptitudes to suppkment the global intelligence tests. and mechanical aptitude~. tests are needed that are specially designed to reveal differences in performance in various functions.ndeJ)endent factors.ed for differential aptitude tests.ation formerly obtained fro~l special aptihlde tl:'sts.!!lechaniea . lIote"iOlthy fact: an individual's erformance on ' test often -showed mar -c variation. have come to be known as "factor analvsis.. L. a parallel development in the stu. arithm~tic re~soning.d)' of trait organization was gradually providing the means for constructing SUC? tests. ~)eeaus~ intellig('J]ce tests were not designed for the purpose of .'sis. numerical aptitude. in which. spatial visualization. ReIley (1928) and L. in the traditional intelligence tests. Others. The . In this connection. were found more often in special aptitude tests than in intelligence tests. Research along these line~ is still in progress under the sponsorship of various branches of the armed services. In place of a total score or IQ. . whereas itcms employing pictures or geometric diagrams may place him at an advantage. Such batteries thus provide a SUItable mstrument for makin<1 the kind of intraindividual anaJ\'Sis I' 1 e~'e ~nOSls. c erica. These batteri('s arc desiuned to provide a measure of the individual's standing in each of a number of traits. One of the chief practical outcomes of factor analysis was the development of multiple aptitude batteries. -TI~ca~lation of intelligence tests that follm. These s ecial a till/de tests ' .'ed their widesl>\'eadand indiscriminate use durinlJ0 the twenties also revealed another . L. . Test users. based on the work of such American psychologists as T. and especially clinicians."-' " To avoid confusion." The contributions that the methods of factor ana'lysis have made to test c'Onstruction will be more fully examined and ill~strated in Chapter 1:3. !hurs~one (1935.eveloped for clVllian. Often the subtests heing compared contain t0o. and scores of other military specialists. the work of thc military psychologists during World War II s. 1947-1948). not only tllC'IQ or other global score but also scores on subtests wonld lJt' examined in the e\'aluation of the indhidual case.('ver. r These batteries also incorporate into a comprehensivl:' and svstl:'matic testing program much of the inform. musical. To some extent.obtam.dIHerel. perceptual. bombardiers. and artistic aptitlldes. a separate score is obtained for such traits as "erhal comprehension. range finders. a person might score relatively high on a verbal subtest and low on a numerical subtest.~11 aphtude anal.ical Testing Functions and OrigillS of PSljchological Testing 15 able.yas especially apparent on gl'OUp tests.

An important step in this direction was taken by the Boston pubhc schools in 1845. In 1947. In modern times. Achievem. as.ent tests are used not only for educational purposes but also III the se]Pchon of applicants for industrial and government jobs. ' . reduced the chance element in question choice. Horacc ~fann cit~d arguments remarkably similar to those used much later to justify the replacement of essay questions hy objective multiple-choice items. and eliminated tIll' possibilitv of h\'oritism on the examiner's part. ami Lewis M.ical Tcsli. t le term "intelliO'ence test" customarih' . there was a growing emphaSiS on the design of items to test the understanding and application of knowledge and other hroad educational objectives.EEB)..B. but also yielded less reliable results than the "new type" of objective items. when written examinations wefe substituted for the oral interroO'ation of students by visiting examiners. \lann noted. Commenting on this innDvati~l. ~lultiple al~tltl1de battenes measure a number of aptitudes but pro\"ide a profile of scores. the technical aspects of achievement tests increasingly came to resemble those of intelligence and aptitude tests. In subscq. government agencies.ting col1eges-c·hangcs that reflect inten'ening developments 111 both testIng and cducation.?f these programs is that of the College Entrance Examination Board ~t.16 COIl!ex! of Psyclwlogict. O'Rourke as director of the newlv established research dh'ision in 1922. GHes ~f. Kelley.l{!. 192:3. put all students in a uniform situation. Ebel & Damrin. well ~s tests in spelling. Terman.~ of Psyc1IO/<l{!. J. Examples include scales for rating the quality of handwriting and written compos. . \fention has already been made of the systematic use of ci\'i\ sen'jce examinations in the Chinese empire. .\' mltl Origi/l.5 . As the latter came into increasing use in standardized achievement tests.56).t1cnt ~'ears. and other institutions. L. and nalional testing programs . The establishment of statewide. 1960 ~. Foreshadowing many characteri·stic'S of modern t'fsting. Probably the best known . . evidence was accumulating regarding the lack of agreement among teachers in grading essay tests.itiol1s. Co) While psychologists were busy developing intelligence and aptitude tests. Stl11 later came the achie\"ement batteries. dating from 111. arithmetic computation. the first stand-ardized tests for measuring the outeomes of school instnl~tion began to appear. 17 term "aptitude test" has been tracHtiollalJ" cmployed to refer to tests measuring relativel\" homo ('ncous and dparlv defined sc rn1C'nts of • I I \. this program has undergone profound changes ill its testing procedures and in the number and nature ?f partie-ipa. this battery provided com~arable measu~'es of perfo~ance in different school subjects. Aft~r the turn of the centurv. Today the difference between these two 'types of tests is dueHy one of degree of specificity of content and extent to which the test presupposes a designated course of prior instruCtion. initiated by the publication of the first edition of the Stanford Achievement Test in 192:3. as contrasted to the recall of factualiminutiae also made the content of achievement tests resemble more -cioselv th~t of intellige~lce tests. By . \[ention should also he made of the American Collegc Testing Program established in 1959 to scrccn applicants to colleges not included i~ thc CEEB program. these tests utilized measurement principks developed in the psychological laboratory. Established at thc turn of the ce_ll'~' to reduce duplication in the exa"tnining of entering college freshmen. The written examiuations. Procedur~s for cons. As more and more psychologists trained in psychometrics participated m the construction of standardized achievement tests. selection of go\'~rnI~lent emplo:-e~s by examination was introduced in European countnes 111the late eIghteenth and eark nineteenth centuries. and arithmetic reasol1lng. refers to more hderogence-.widely recognized that essay tests were not only more hme-cOnsumll1g for examiners and examinees. one for eaeh aptitude.as another noteworthy parallel denlopment. regional. Spearheaded h~' the work of E. Test construction techniques developed during and prior to World "'a~ I were introduded into tll<:'examination program of the United States Ch-il Service with the appointment of L.trllcting and evaluating all ~hese tcsts have much in common. for which the new ohjec:tive tests could be readily adapted. and of several national testing programs for the selection of highl\' talented students for scholarship awards. ~ests yielding a single global score sm:h as an IQ. Thorndike. Its authors were three earl" It'aders in test development: Truman L. The deeade of the 19:305 also witnessed the introduction of test-seoring maehines.. The l!llited States Chi! Service Commission in~talled competitive examinations as a regular procedure in 1883 (Kanuck. evaluated 111 terms of a smgle norma live group.{ Tcsrillg FI/I1C!iol1. ETS has assumed responsibility for a growing number of testlllg programs on behalf of universities. The incre~s!ng effOlts to prepare achIevement tests that would measure the attainment of broad educational goals. traditional school examinations were undergoing a number of technical improvements (Caldwell & Courtis.c. At the same time. permitted a wider cO\'erage of content. professional schools. the testing functions of the CEEB were llIerged with those of the Carnegie Corporation and the American Council on Education to form Educational Testing Service (ETS). 19.. Ruch.1930 it was. S~)ecial aptitu~c tests typically measure a single aptitude.

19(1). The protntype of tht. thereby reducmg the chances that the subject can dt'li1wrateh. . in a hroader sense.e al.. cspecially among dlll1CIans. served as a model for most subsequent emotional adjustment inventories. Hesearch on the ~~~urement ~f. Lik(' the performancc and situational tests. A more recent illustration. and interpreting pictures or inkblots. although some psychologists prefer to lISt' the term personalit~. This test was designed as a rough screening device for identifying seriously ~urotic men \\'110 would be' unfit for military service.l.J' IIIIC/ /(111. The assumption underlvincr such metllocls is that the indi\'idual will project his characteristic m~d~: of response into stich a task. On the whole. .man) mgemous devIC. :\Iost of these tests s~llIulate e\'eryday-life situations quite c1ose1~'. dlsad\:antages.eh.p'ortions since . The fre(' association technique has subscqllenth' becn utilized for a vari('ty of testing purpos('s and is still curr('nth. moreover. and drugs and concluded that all these agents increase the relati\'{~ frequenc~' of superficial associations. Immediatel" after the war. and pcmstenct'. In thIS test the subject is gh'en specially selectcd stimulus words and is required to r('spond to each with the first word that comes to mind. The Personal Data ~heet was )lot completed carly enough to permit its operational use . is l~ro\'ided by the series of situational tests developcd during l " OJld "ar II 111 the Assessment Program of the Office of Strate<Tic Services (OSS. . Tests d('signed for this purpose are commonly known as personality tests.~'hich the individual answered about himself. A total score was o\5t~ined by counting the number of symptoms reported. 19 Another area of psy<:holo~ical testing is concerned with the aH('ctive or nonint('lIectnal aspects of b('ha\'io!'. In such tests. of J'sydl(l'(/~i('111 1'<'S!iIlt!. including a special form for use with children.All(~th('rapproach to the measurement of personalit~' is through the appllc. too. suggested that the free association test might be used to differentiate between the various forms of mental disorder.so been tlSed in this manner. An earl~' precursor of personaJit~' testing may be r('cognizcd in Kra.tev. personalit\' qnpstionnaire. W('H' based l'ssentialh' on <llll'stionnaire t('chniqul's. 5.lttell in the dpyelopment of standardized questionnaire and ratin~-. Otller tasks commonly employed\n proJech\'e techniques include drawing. proje~ti\'l' techniqucs are mor(' or less disguised in lhl:'ir purpose. is the Per~(lnal Data Sheet developed by \Voodworth durin~ \"orId \Var I (DuBois. Other tests concentrated more intensively on a narrower area of bc-!Ja>ior 01' Wl'I'(' <:olll:erncd with mOl'(' dbtindly social r('~pons('s. also writing: during the last decade of the nineteenth century. 1929. lying. Pro. quantitative scores could he obtained on each of a largc numb('r of sp('cific tests. . howcver. was concerned \:'ith such beha"ior as cheating. interpersonal relations. for the a~1I1. such as home adjustment.J)efore the war cnded. ' ~nd . ('xtempor~nt'ous dramatic play. 1970. cooperatin'ness. both practi~al and theoretical. These tests wem' C:Oll('erned with rclath·ely ~omplex and subtle sodal and emotional beha\'ior and refluired rather ehlborate f~cilities and tr~lin:d personnel for their admillistration. \Iention should also be made of the 'York of Galton. The \Voodworth Personal Data Sheet. These tests. Objective. Sellten('e-completion tests hav.en\plcn'ed. moth·ation. these proc-edmes \wre e\'entual1~' employed by othNs in constructing some of the most common types of current personality tests.nall.'csand techmcal J1nprovemeil~s arc under ~VeStigabon. and their associates (1928. Each approach has its own special advaqtages and. Intellectual as well as nonintellectual traits . Symonds. In some of these questionnaires.r the spt'cial difficulti~ encountel:fd in the easurement of personality that account for the slow advances in this u~ . But such lack of progress is not to be attributed to insufficient eHOI't. 19-48). The inventor\' conslst~d of a number of questions dealing with common neurotic sy~pt01'!lS. personality testing has lagged far behmd aptitude t('sting in its positive accomplishments. or self-report inventory.e subject s responses. pers?nality ~as attained i~pr~s~ive Pl~p. A later development \\'as th<: constmction of tests for quantifying the expression of interests and athtude's. stealing.~'ale tl'chniqn('s. the subject is gi\'en a relatin'Jy unstructured task that permits "'ide latitudl' in its solution. and attitudes. The interpretatIOn of th. school adjustment."ould thus be included under this heading. \\'as rdati\'C I~' suhjectivc. to refer to the cntirc individual. and vocational adjustment. .tyand olle that has shown phenomenal gro\vth. 19:30). 19:31. however. standardized on s'choolchildren. The prc\'iously cited free association test'represe. Sommer (1894). moreover.\hon of performatlce or situational tests. ~fa\'.ectll. All. interests.:pelin's use of the free association test with abnormal patients. hunger.e techniqlles represent a third approach to the study of persO. the designation "personality test" most often refers to measures of such characteristics as emotional adjustment. the subject has a task to perform whose purpose is often disgUised. civilian forms were prepared. This series. In such tests.Th(' first extensive applicatIOn o~ such tl'chniqnes is to be found in the h'sts de\'eloped in the late twenhcs and earl~' thirties by Hartshorne. It is rathe.a\'aJlable types of personality t('sts present serious difficulties. Kraepelin ( 1892) also employed this technique to study the psychological effects of fatigue. Goldlwrg.\ {///(/ (higill.nts one of thc earlIest types of projccth'e techniques. In the terminology of psychologit·al testing. arranging toys to create a scene. an attempt was made to subdivide emotional adjustment into more specific forms. Although origin~l1y devised for other purposes. such as dOl1lmalll'C-sublllission in interpersonal ('ontacts.. Pear~on: and C.create a desired impressi?l1.

References to such publications are given in the appropriate chapters of this book. Finanv.38. it should be noted that the most direct source of information regardiI. together with ~he vast number uf available tests. can be found in the lates't Mell/al M el/S/lTcmcnfs rearl)()ok~ For reach' reference. largely supplementing the material listed in the MMY.Hell/(/I 11ealtll Measures (Comn'~·. & Glaser.-\merican p'uhlishers and distributors of psychological tests are gi\'en in AppendiX D. and convenience. and evaluating a particular test. revisc>d forms of old tests.tations. The earhest publications in this series were merely bi~)liographies of tests: B~ginning in ]9.for paper-~ndpencil tests. Tests. this handbook describes instruments located through an intensi\'(~ journal search spanning a ten-year period. Since I9iO several sourcebooks have appeared which provide information about u~published or little known instruments. Both include a numbeF'~9f tests not found in any volume of the MMY. 1973). and additional data that mav refine or alter the interpretation of scores on existing tests.000 measures. Eaeh yearbook includes tests publIshed dunng a speCified period. reliahilit~·. however. however. and the specific criteria against which validity was checked. as nt'w data accumulate from pertment research. rating scales. references to the printed sour<. 11970). A comprehensive bibliography covering all types of published tests available in English-speaking countries is provided by Te:~ts in Print (Buras. The Ser. thus supplementing rather than supplanting the earlier yearbooks. A comprehensive survey of such imtruJl1cnts {. Information 011 asses~ment devices suitable for children from birth to 12 years is summarized in Tests and Measurements in Child Development: A Handbook (Johnson & Bommarito.Routine information regarding poblisher. For each of :1.earbook assumed Its ('UlTt'I\t form. such as the Journal of Educational Measurement and the JOllrnal of Counseling Psyc1101ogy. the methods employed in computing indices of reliability and valIdity.Psychological testing is in a state of rapid chan~e. forms. adequate instructions for administration and scoring. as well as master indexes'that facilitate the location of tests in the :\1\1Y.enth Mental Measurements r ear7JOok. and validity. 197:1). The accelerating rate of <:hange. ] 975). \\'ith addresses. In order to keep abreast of current developments. however. anyone working with tests needs to be familiar with IlUoredirect sources of contemporary information about tests. is concerned principally with tests appearing bet\\'een 1964 and 1~70. and education. of continuing interest. this sourcehook includes tests. :\lanuals and specimen sets of tests can be purchased hy qualified users. wlll(:h llldudes critical reviews of most of the tests by one or more test experts.p. In ~he e\'ent that the necessary information is too lengthy to fit conveniently mto the manual. scoring key. 1968) and Personality Tests and Reviews (Buras. sufficient length.3 and 6 years (Walker. educational. and age of subjects for whom the test is suitable is also regularly giv('n. It might be added that ma~y test manuals still fa!1 short of this goal. Two related sources are Reading Tests and Reviett. the manual should report the number and nature of subjects on whom lIonns. may be reviewed r~peat('dly m StH.100 abstracts. the namt's and nddrt'sses of some of the largt'r . But some of the larger ancl more professionally onented test publishers are giving increasillg attention to the preparation . and validity were est~b~ished.m hr found in A SourcelJook for . Backer. in other. this volume' gives the original sOl\J'et' as well as an annotat<. The manual should. Cltalog\1('s of current tests can be obtained from each of these publishers on requcst.·cesSlyey~arhooks. One of the most important sources is the series of Mental !If easurements )'eaTbooks (MMY) edited hy Buros (19i2). :\fo!'E'over. a constant stream of new tests. scoring. There are shifting oriel. personnel selection. Cobb. for example.. Containing approximately 1. Th('sc yearbooks cover nearly all commercially available psychological. enable the test user to evaluate the ·test before choosing it for IllS specific purpose.. A still more specialized collection CO\'crs measures of social and emotional development applicable to children between the ages of . words.!!: specific curr~ltksts is pro\'ided h~' the catalo~t1cs of tcst publIshers and b~' tht· mannal that accompani0s ('ach test.d bibliography of the studies in which the measure was subscquently used. 1974). The test manual should provide the ('ssential infurmation required for administering. Covering only tests not listed in the \nrr. The coverage is especially . questionnaires. clinical practice. and other <ledc('s for assessing both aptitude and personality variables in adults and children. -price.of use (i. A comprehensive list of test publishers. Reviews of specific tests are also published in several Ilsychological and educational journals. In it should be found full and detailed instructions.~ (Bums. norms. . & Frenrh.:esin which such information can be readily located should be given. Selection criteria included availability of the test to professionals. and vocational tests published in English. makes it impracticable to sun'ey speCific tests in any single text. Another similar reference is entitled Measures for Psychological Assessment (Chun. \lore intensive coverage of testing instruments and problems in special areas can be found in books dealing with the us~ of tests in such fields as counseling. and data on reIiahilit~. the ).complete . as well as a complete list of published references pertailling to each lest. not requiring expensive or elaborate equipment). The entries were located through a search of 26 measurementrelated journals for the Years 1960 to 1970. 1971).

PU?- CHAPTER 2 J\rat1ure arld Use of Psyclz. norms. insofar as 0R~flh~tions are made on a small hut carefully chosen . They are concerned with the information about validity. BEHAVIOR SAMPLE. interests. 4.-A.. and achievement tests. The major categories of psychological tests will be discussed and illustrated in Parts 3." do psychological tests differ from other methods of gathering information about individuals? The answer is to be found in certain fundamental features of both the construction and use of tests. A succinct but comprehensive guide for the evaluatwn of psy~hologlcal tests is to be found in Standards for Educational arul Psyc11010glCal Tests (1974). as well as the wide diversity of available tests. interpersonal behavior. these tests represent only a small proportion of the available types of instruments. the psychologist proceeds in much·. . It is with these featm!es that the present chapter is concerned. Relevant portions of the StQnda~ds "ill. in connection with the appropnate tOpICS. and personality tests. traditionally called intelligence tests. tests of separate abilities.}swater supply by analyzing .~hat are tIle common differentiating characteristics of ps~'Chological tests? Ho. an ip~jyjil~)rs behaviQr. a clerk's ability to perform arithmetic computa- .the 'Jame way as the chemist who tests a patient's blood or a community. In the face of such diversity in nature and purpose.~ .be cited in the following chapters.. An enlightened lie of test users provides the firmest assurance that such standal'ds wIll be maintained and improved in the future. In their latest revision.-et'more samples of it. and other noncognitive characteristics. These standards represent a summary of recommended practices 111 test construction based on the current state of knowledge in the field. concerned with measures of emotional and motivational traits. the Standards also provide a guide for the proper use of tests and for the correct interpretation and application of test results.~d standardized measure orit's'ample of behavior. psychological test is essentially an objective . '\'hich cover tests of general intellectual level. In this respect. including multiple aptitude batteries. If the psychologistwish¢'~ to test the extent . and 5.22 Context of Psyc11010gical Testing of manuals that meet adequate scientific standards. Although the general public may still associate psychological tests most dosely with "IQ tests" and with tests designed to detect emotional disorders.ological Tests T introduction in Chapter 1 has already suggested some of the many uses of psychological tests. tests of special aptitudes. attitudes. reliability. published by the American Psychological As~ocia~ion.iff a c1lild's vocabulary. Psychological tests are like tests in any other science. HE HISTORICAL .sample .. . and other test characteristics that ought to be reported in the manual.

Such a requirement is only a speCial application of the need for controlled conditions in all scientific ohse-ryations." Ko psychological test can do more than measurelJel1"UDor.. testin~ conditions must obYiously be the same for all. It should be noted ir. be used with caution in reference to ps~'dlOlogical tests. Stand~rdized testing p.read . And each mUst prove Its worth by" an empirically demonstrated correspondence between the subject's pcrformance on the test and in other situations. Measurement of the hehaYior sample directl~' cO\'ered by the test is J:arely.lJS chapter dealing \\'Jth problems of test administration. this connectiolJ that the test items need not resemble closely the beha.. would be a poor measure of the indiyidual's computational skill. In order to secure uniformity of testing conditions. the test constructor provides detailed directions for administering each newly developed h:'st.. "'hetlwr or not the test adeqnately co\'(. however. prc>Jiminary demonstra: ~ns. 'Vh~ethci:S\1ch behavior can serve as an effective inc!('x of other IX'hador can be determined only by empirical try-out. Different typps of tests can then be characterized as variants of this basic pattern. the goal of psychological testing. ~le including only multiplication items. the test mav coincide completelY with a part o'f the b.of vocabulary. hO"'ever. in which there is only a mod<'rate rese ance between the tasks peIformed on the joh and those incorporat . A yoealmlary test composed entirely of baseball terms would hardly proYide a dependable estimate of a child's total range of vocalmlar~'.\:Imple might be a foreign vocabulary test in whi!=·htilt:' students are examilled on 20 of the 50 nt'\\words th~y have studied.\. but would in itself presuppose no knowledge of French. An e. of .. tot eX. .tions. Prediction eommonly connotes a temporal estimate. Whether the term "diagnosis" or the term "prediction" is employed in this connection also represents a minor distinction. the individual's future performance on a job. Will be dJscussed further m a later sect~g~ of-<tl. time limits. he ('xamim's their performance with a representatin' set of wonls. In a test situation. The degrec of similarity between the test sample and the predicted behavior ma\' vary widely.'. It is logically Simpler to consider all tests as behavior samples from which predictions regarding other JX." . Such tenus should. STA:-. and evel. Anotlwr point that should be considered at the outset pertains to the cone-ept of Clll}(/cify. it can be demonstrated that there is a dose correspondence between the child's knO\dedge of the word list and his total l1laster~. other ~ the testing situation. the correct an~wer may be given away by smiling or paY~jlg wh~n the crucial word J~. Standardization implies !miformifll of ~)rQcedllre in 'hdnl11Hsfenng and SCoring the 'test If the scores obtained by different iudiyiduals are to be comparable. It . :11'ithmclie prol>lems.aminer's point of th~\. oral instructions to subjects. In a nls hroader sense. It could then be said that this test measures the indh'idual's "capacity" or "potentialitt for learning French. pauses. to dc\'isc a test fur predicting how well an individual can learn Fre11Ch before he has even begun the study of French.rocedure. heing foreeast from his present test performance.:.~O which it sen'es as an indicator of a relatively broad and !!guinea. suell as mental retardation ur emutional disorder. A lesser degree of similarity is illustrated by many vocational aptitude tests administered prior to joh training.()f a test measuring "capacity. Thus."It is only necessary tna " .. presenting problems orally. Onl\' in the senSe that a present behavior sample can be used as an indicator of other.in the test. Despite their superficial differences. if ever.lnit>le.vior the test is. e\-en the diagnosis of present condition.'h~or to he preclictt'cl. then the tests are ser\'ing their purpose. in giving instructions or.to }[('dicr. future behayior can we s~ak. for example. The formulation of such directions is a major part of the standardization of a new test_ Such standardization extends to the exact materials em 'plo~d. ho\\"('\'er.ld test taken prior to obtaining a driver's liccme.havior can be made. implies a prediction of what the incIi\'idual will cIO in situations other than the present test. and to other complex. an ant 1I1letJctest consisting of only five problems. :Many other. tone of voice.'rs the behavior under consideration obviously depends on the number and nature of it in the samp e. more subtle factors may influence the subject's performance on certain tests.a7uc of a lsycholC!gical test depend~_ol! the debH. for example. in which an attempt is made to predict from the subject's as~ociations to inkblots how he will rcad to other people. and faCj~1 e}pression. or a pilot's eye-hand coordination. or between the applicant's score on the arithmetic problems and his computational performance on the joh. in itself. Such a test would invoh-e a sample of the types of behavior required to learn the new language. \1:w.~i[ ex. inflection. In a test involving the detection of absurdities.on ence be demoHstrated bet"'ecn the tm).great interest. or examp e. It is entirely possible. to ~motionally toned stimuli. At the other extreme one finds projecth'e personality test!>'" eh as the Rorschach inkblot test.t area·Ofb~. another example is provided by the ro. or motor tests. Nor is the job applicant's performance on a specific set of 20 arithmetic problems of much importune-e_ If. The diagnostic or 'redictiJ.DARDIZATIO:-. At one extreme. consideration must be given to the rate of speaking. The child's knowledge of a particular list of 50 words is not. everyday-life situations.c t. all these tests consist of samples of the indi~s behavioL. ways of handling queries from subjects. ~r:.-:"iIl recalled that in the initial definition a ps~-he chological test \\'as described as a standardized measure. the single independent \'ariable is usuall~' the indh-idual being tested..

the administration.••. scoring. In the process of standardizing a test. Psychological tests have no predetermined standards of pli5singor fa'inng. to stubborn rejection. new items can be added to fill the gaps. the term reliability always means consistenc~'. It is thus apparent that psychological tests. 1905 scale for the measurement of intelligence. and personal biases may lead. Reference to the definition of a psychological test with which this discussion opened will show that such a test was characterized as an objective as well as a standardized measure. the norm corresponds to the performance of typical or average individuals. taken to be the easiest. . specific way~. in esse!1tially the same way as for aptitude tests. number of errors. It is thus possible to evaluate different degrees of superiority and inferiority. The norm on a personality test is not necessarily the most desirable or "ideal" performance. for example. The items correctly solved by the largest number of' children were. .~ the same persons when retested with the identical test or with an eqRhYalent form of the test. some items can be discarded. It might also be noted that norms are established for personality tests .'C of unfavoral. Thus. By this procedure. on the other hand. The latter is known as the raw score on the test. empirical procedures. time required to complete a task. ipso facto.:xarnple typifies the objective measurement of difficulty level. on the one hand. The specific ways in which such norm" may be expressed will be considered in Chapter 4. it will be recalled. This is not entirely so. The determination of the difficulty level of an item or of a whole test is based on objective. RELIABILITY. Test reliability is the consistency of scores obtain_ed. representative sample of the type of subjects for whom it is designed. How good is this test? Does it really work? Thes£l quest~ons could-and occasionally do-result in long hours of futile discussIOn. t?e As used in psychometrics. :l'ot only the arrangement but also the selection of items for inclusion in a test can be determined by the proportion of subjects in the trial samples who pass each item. 'Vhen Binet and Simon prepared their original. was determined by trying out the items on 50 normal and a few mentally retarded children. Such difficulty. if normal B-year-old children complete 12 out of 50 problems correctly on a particular arithmetic reasoning test. Similarly. There are other major ways in which psychological tests can be properly described as objective. the norm does not ordinarih· correspond to a complete absen<. of whatever type. then the 8-year-old norm on this test corresponds to a score of 12. or some other objective measure appropriate to the content of the test. As its name implies. If a child receives an IQ of 110 on Monday and an IQ of 80 . Such a raw score is meaninglcss until evaluated in terms of a suitable set of norms. Similarly. Thus. hunches. in an emotional adjustment inventory. the nonn falls at an intermediate point representing the degree of dominance or submission manifested by the average individual. logical test construction.>le or maladaptive' }'esponses. On dominance-submission tests. of comse. Anv one individual should theoretically obtain the identical score on a test r~gardless of who happens to be his examiner. This group. they arranged the 30 items of the scale in order of increasing difficulty. which is now common practice in psycho. a norm is the normal or average performance. an empirical order of difficulty was established.empirical trial. are bascq'· on lmpirically established norms. The olJ]ectlve evaluation of psychological tests involves primarilv d~tennination of the reliability and the validity of the test in specified Sltuatlons. known as the standardization sample. In . Subjective opinions. It may be expressed as number of correct items. any more than a perfect or errorless score is the norm on an aptitude test. The only way q~estlOns sU~h ~s these can be conclusively answered is by.are such tests objective? Some aspects of hat the objectivity of psychologieh'l tests have already been touched on in OBJECTIVE MEASUREMENT OF DIFFICULTY. since a few such responses occur in the majority of "normal" individuals in the standardization sample.26 COlltext Of Psychological Testing Nature alld Use of Psychological Tests 27 Another important step in the standardization of a test is the establishment of norms. More technical aspects of item analYsis will be considered in Chapter 8. if there is a bunching of items at the easy or difficult end of the scale. On both types of tests. those passed by relativdy few children were regarded as more difficult items. This early . since perfect standal'dization and objectivity have not been attained in practice. . serves to establish the norms. Thus. and interpretation of scores are objective insofar as they are independent of the subjective judgment of the individual examiner. if items are sparse in celiain portions of the difficulty range. an individual's score is evaluated by comparing it with the scores obtained by others. a the discussion of standardization. All permit the designation of the indi"idual's position with reference to the normative or standardization sample. it is administered to large. to extravagant claims regarding what a particular test can acco~plish and. Such norms indicate not only the average performance but also the relative frequency of varying degrees of deviation above and below the awrage. But at least such objectivity is the goal of test consb'uction and has been achieved to a reasonably high degree in most tests.

The scores of these persons are not themselves employed for operational purposes but serve only in the process of testing the test. The number and nature of individuals on whom reliability was checked should likewise be reported. rind criterirJn measure and hence poor validity for the test. however. if a medical aptitude test ist9 be used in selecting promising applicants for medical school. should be coq$fdered at this time.ith the test? The answer to this riddle is to be found in the distinction between the validation l. if in olle set of 50 words an individual identifies 40 correctl~·. knowledge. the person's present level of prerequisite skills. With such information. or whether it is likelv to be more reliable or less reliable. to become available. VALIDITY. can be validated against on-the-job success of a trial group of new employees. external criteria of-whatever the test is nesigned to measure. Tests designed for broader f\nd more varied uses are validated against a number of criteria and their validity can be established only by the gradual accumulation of data from many different kinds of investigations. on any group in order to obtain the information that the test is trying to predict. A vocational aptitude test. The interpretation of test scores would undoubtedly be clearer and less ambiguous if tests were regularly named in terms of the criterion . From the given data. supposedly equivalent set he gets a score of only 20 right. whereas those scoring low on the test had done poorly in medical school. and the like. Similarly. i.test. Such a composite measure constitutes the criterion with which each student's initial test score is to be correlated. A low correlation would indicate little correspondence l. \Vhether one or neither is an adequate estimate of the individual's ability in vocabulary cannot be established without additional information. as well as methods of measuring each. success or failure in completing training. its validity must be established on a representative sample of suhjects. ratings by instructors. Some measure of performance in medical school would eventually be obtained for each student on the basis of grades.>e validated against achie\'ement in flig:lt training. The more valid and reliable thef~. tests designed for other purposes can be validated against appropriate criteria. and other aspects of the testing situation. The special problems encountered in determining the validity of different types of tests. had been relatively successful in medical school. By studying the validation data.. and other relevant characteristics can be assessed with a deferminable margin of error. If it is necessary to follow up the subjects or in other ways to obtain independent measures of what the test is trying to predict. Thus. The reader may have noticed an apparent paradox in the concept of test validity. A pilot aptitude battery can 1. To be sure. the particular selection of items or behavior sample constituting the test. but tlus could be demonstrated only by further retests. as well as the specific criteria and statistical procedures employed. A high correlation.. we can conclude only that both scores cannot be right. or validity coefficie. a thorough. we can objectively determine what the test is measuring. it would be administered to a large group of students at the time of their admission to medical school. Validity provides a direct check on how well the test fulfills its function. For example. In a similar manner. The determination of validity usually requires independent. ultimatle success in medical scholYlwould be a criterion. Before a psychological test is released for general use.when retested on Friday.!t. then neither score can be taken as a dependable index of his verbal comprehension.. . willlJ~ fhscussed in Chapters 6 and 7. objective check of its reliability should be carried out. Before the test is ready for use. the smaller will be this ..t"'ppn tp~t ~('orp.. But such a procedure would be so wasteful of time and energy as to be prohibitive in most instances. If the test proves valid b~' this method. It might still be argued that we would need only to wait for the criterion measure to mature. Undoubtedly the most important question to be asked about any psychological test"concerns its validity. the role of different examiners or scorers. for example. we could detennine which applicants will succeed on a job or which students will satisfactorily complete college by admitting all who apply and waiting for subsequent developments! It is the very wastefulness of this procedure-and its deleterious emotional impact on individuals-that tests are designed to minimize. why not dispense v.e.ft actually tells us what the test is measuring. it can then be used on other samples in the absence of criterion measures.margin of error. By means of tests. Validitv tells us more than the degree to which the te~t is f~lfilling its funcpari. will be considered in Chapter 5. Reliability can be checked with reference to I temporal fluctuations. it is obvious that little or 110 confidence can be put in either score. whereas in another. the test user can predict whether the test will be about equally reliable for the group with 'which he expects to use it. The different types of test reliability. in both illustrations it is possible that only one of the two sC'ores is in error. because the same test may vary in these different aspects. It is essential to specify the type of reliability and the method employed to determine it. The validity coefficifnt enables us to determine how closel\' the criterion perfor~ance could have been predicted from the test scor~s.TfOUp on the one hand anci the groups on which the test will eventually be employed for operational purposes on the other. It would thus be more accurate to define validity as the extent to which we Jrnow what the test measures. In the process of ·y~lidating such a test. One further point. the degree to which the test actually measures what it purports to measure. would signify th~t those individuals who scored high on the.

order to forestall deliberate efforts to fake scores. Finally. may give her class special praettee in prob. Careful conh-ol of testing conditions is also essential.arding the individual being tested are essential in interpreting any test score. an e"nlnation of its technical merits' in terms of such characteristics as validity reliability difficulty level. as well as a thorough familiarity with the standard instructions. A tendency in this direction w pe'recognized in such test labels as "~cholastic aptitude test" and sonnel classification test" in place of the vague title "intelligence When applied to an intelligence test. testing time required. temporary emotional or physical state of thl> subject.that measures each child's inllate potential. and the like. pertaining to reliability.idual. i~sion to. Other information. I've been too upset to go to class ever since. The need for a qualified examiner is evident in each of the three major aspects of the testing situation-selection of the test. nature of the group on which norms were established.We would now like to have the scores for their personnel folders. however..' invalidate the test an ( to ensure tat e test is used ~ a qualified :> . "so that the pupils will be well prepared to take the test. In the hands of either the unscrupulous or "we -meamng ut uninformed user. Each is based on a re~fincident.\ The proper interpretation of test scores requires a thorough understanding of the test. There are two principal reasons for controlling the use of psychological ests: (a) to revent general familiarity with test content. the test would be completely invalidated. An adequate realization of the need to follow instructions precisely. nd the list could easily be extended by any psychologist." . the validity of the test as a predictive instl'l1ment is reduced. bulkiness and ease of transporting test materials. or the test may be invalidated in good faith by misinformed persons. or other easy marks of identification. Under such conditions. it requires no psychological training to consider such factors as cost. '~\' if an individual were to merr'lbrize the correct' responses on a test o'f' color blindness. from a mail-order catalogue.1ems closely resembling those on an intelligence test.CHOLOCICAL TESTS THE USE OF 'y I:have a Stanford-Binet blank? ~fy nephew has to take it next week for. some consideration must also be given to special factors that may have influenced a particular score. 'ast ~'enryou gave a new personality test to our employees for research pur. however. and the testing <'Onditiolls.School X and I'd like to give him ~ol1lepractice so he can pass.Context of Psychological Tes/ing '~:~hl:ough hich they had been validated. and i~terpretation of scores. The same score may be obtained by different persons for very different reasons. we need a culture-free IQ . Tests cannot be chos'en like lawn mowers." o improve the reading program in our school." . it is likely that such specific training 01' coaching will raise the scores on the test without appreciably affecting the broader area of beha"ior the test tries to sample. psychological t~~s"roJ!~. for example. A \ . Under these condItions. i~ required if the test scores obtained by different examiners are to be comparable or if anyone individual's score is to he evaluated in terms of the published norms." st night I answered the questions in an intelligence test published in a ~gazine and I got an IQ of SO-I think psychological tests are silly. In the absence of proper checking procedures. O' . such a test w~ld no longer be a 'measure of color vision for him. . incorrect or inaccurate scoring may render the test score worthless. such as unusual testing conditions. author. For the test to serve its function.LP. and norms is essential. Test content clearly has to be restricted in . administration and scoring.!:_ 9perly used to be effective.to occur than is generally realized. is likewise relevant. scoring errors are far more likeh." Such an attitude is simply a carry"over from the usual procedure of preparing for a school examination. the individual. which would . such tests can cause serious ~~~ ~ . What is being measured can be objectively determined only by reference to the specific procedures in terms of which the particular test was validated. as to rrnder the tests worthless or to hurt the indi:. In other cnses. Some background data reg. They cannot be evaluated by name.V. Similarly. the effect of familiarity may be less obvious. 'SONS FOR CONTROLLING .Like ny sd~ntillc instrument or precision tool. and extent of the subject's previous experience with tests. and ease and rapidity of scoring." The above ·remarks are not imaginary.poses. SuQ't remarks 'lustrate potential misllses or misinterpretations of psychological tests in uch wavs. Information on these practica] points can '\lsually be obtained from a test catalogue and should be taken into account in planning a testing program.t .. . 'y roommate is studying psych.schoolteacher. She gave me a personality test and I came 1 neurotic. To be sure. Only in such a way' ~an the tes~ user determine the appropriateness of an)' test for his particular purpose and its suitability for the type of persons with whom he plans to use it. however.re be quite dissimilar. The conclusions to be drawn from such scores would therefo. The introductory discussion of test standardization earlier in this chapter has ah'eady suggested the importance of a trained examiner.

When apparatus is employed. There is also evidence to show that the Slli9ir~loyed may affect test scores (Bell. proved to be significant in a group testing project with high school students. Hoff. the equivalence of these answer sheet# cannot be assumed. and arranged in advance of the testing day.. The preparation of test materials is an9ther important preliminary step. Special efforts must therefore be made to foresee and forestall emergencies.Nature alld (he of PsycllOlogiclIl Tc'sls 33 · J The basic rationale of testing im·olves generalization from the behavior sample observed in the testing situation to beha"ior manifested in other. In individual testing and especially in the administration of performance tests. But such a survey falls outside the scope of the present book.. 1~43:Traxler & Hilkert.'11 machine-scorable answer sheets. In the absence of empirical verification. special pencils. Materials should generally be placed on a table near the testing ta. and is in charge of the group in anyone testing room. venti~ .manner during test admillish'ation. for example. Standardized procedure applies not only to verbal instructions. ADVASCE PREPARATIOS OF E. how the student will achieve in college courses. checked. answer individual questions of subjects within the limitations specified in the manual. and how the applicant will perform on the job.-19t3~1~li'~1lfr-~~ab1ishment of independent test-scoring and data-processing agencies that.edure is advanc-e preparation. Sattler (1974). A test SCOl'e should help us to predict how the client will feel and act outside the clinic. Thorough familiarity with the specific testing procedure is another important prerequisite in both individual and group testing. such training may requi. For individual testing. all test blanks. & Hoyt. such preparation invqlves the actual layout of the necessary materials to facilitate subsequent use with a minimum of search or fumbling. Even apparentl~' ·minor aspects of the testing situation may appreciably alter performance. some· previous familiarity with the statements to be read prevents misreading and hesitation and permits a more natural. 'Advance preparation for the testing session takes many forms. The most important requirement for good testing proc. Even ill a group test in which the instructions are reauto the subrects. -. provide their 0\1. Depending upon the nature of the test and the type of subjects to be examined.. A whole volume could easil\' be devoted to a discussion of desirable procedures of test administration. In general. informal . materials. such preparation may include advance briefing of examiners and proctors. the examiner reads the instructions. It is therefore important to identify any test-related influences that may limit or impair the generalizability of test results.re from a few demonstration and practice sessions to over a year of instruction. make certain that subjects are following instructions. uch techniques within specific settings.\wed. the groups using desks tending to obtain higher scores (Kelley. examiners sometimes administer group tests with answer sheets other than those lIsed in the standardization sample. answer sheets. For group testing.· or other materials TESTING COXDlTlOXS. and espeCially in large-scale projects. nontest situations. and other aspects of the tests themselves but also to the testing environment. frequent periodic checking and calibration may be necessary. For detailed suggestions regarding testing procedure. from the examination of infants to the clinical testing of psychotic patients or the administration of a mass testing program for military personnel. Such a factor as the use of deSKSor of chairs with desk arms. and prevent cheating. and Terman and Merrill (1960) for individual testing. Memorizingthe exact verbal instructions is essential in most individual testing."I:AMINERS. 1942).a{ls. It is important to realize the extent to which testing conditions may lI1fluence scores. Any influences that are specific to the test situation constitute error variance and reduce test validity. In testing there can he no emergencies. see Palmer (1970).. Only in this way can unifom1ity of procedure be . may be administered with any of five different answer . The present discussion will therefore deal principally with the common rationale of test administration rather than with specific questions of implementation. supervised training in the administration of the particular test is usually essential. timing. for example. and Clemans (1971) for group testing. it is more pra~ticable to acquire ~. The Differential Aptitude Tests. takes care of timing. provided all personnel have learned that such a sign means no admittance under any circumstances. In group testing. because no one person would s normally be concerned with all forms of testing. locking the doors or posting an assistant outside each door may be neeessarv to-prevent the entrance of late-comers. In the testing of large groups. This room should be hould wvide . Some attention should be iven to the selection of a . The proctors hand out and collect test materials. so that each is hilly infonned about the functions he is to perform.~cial~ should a so e ta -en to prevcnt mtcrrup ons unng the test. needed should be carefully counted. ~ flijJ.~le so that they are within easy reach of the examiner but do not distriCt Vte subject. Moreover. Posting a sign on the door to indicate that testing is in progress is effective.

For this reason it is better for the examiner not to be too demonstrative at the outset. Hata. 1956). nodding. The testIng should be presented to the child as a game and his cunoslty aroused before each new task is introduced. cheerful. as illustrated by smiling. and the ~asks should be varied and intrinsically interesting to the chll. the general manner and behavior of the exam. It is the reonsibility of the test author and publisher to descdbe such procedures ully and clearly in the test manual. he ~ no longer inrpret the subject's responses in terms of the test norms. record any unusual testing onditions. Specific techniques for establishing rapport vary with the nature of the test and with the age and other characterbtics of the subjects. the use of (Illy separate answer t may significantly lower test scores (Meh'opolitan Achievement Test ial Report.. uniformity of conditions is essential for comparability of results. A certain flexibilitv of procedure is necessary at this age level because of possible refusal~. the nstructions call for careful concentration on the given tasks and for put'ng forth one's best efforts to perform well. having the child mark the \'ers in the test booklet itself is generally preferable. suca persons are likely ·to . In testing preschool children. III the administration of a typing test. At these grade levels. his performance cannot be directly compared with the norms or with that of other children who are motivated only with the standard verbal encoura"ement 01' . The older schoolchild can usually be motivated through an appeal to his competitive spirit and his desire to do well on tests. Test periods should be br~ef. . because they were nd to yield substantially different scores than those obtained with the reI' sheets used by the standardization sample.first.aken into account in interpreting performance. the term "rapport" refers to the examiner's effOl'ts o arouse the subject's interest in the test. more subtle testing conditions have been shown to affect ormance on ability as well as personality tests. . If a child is given a coveted prize whenever he solves a test problem correctly. the presence of the examiner in the room tended to hibit the inclusion of strongly emotional content in the stories (Bernein. motivational problems may be encountered in testing emotionally disturbed persons. any other. 19i5). in certain projective tests. . The training of examiners covers techniques for the establishmcnt of rapport as well as those more directly related to test administration. 1952. and 12. however.This pro~le~ and others pertaining to the testing of persons \\ lth diSSImilar expenential backgrounds will be c'Onsidered further in Chapters 3." were shown to have a decided effect on test results "ickes. Special. \Vhen he docs so. Examples. timid child needs more preliminary time to become familiar with his surroundings. the examiner cannot assume they will be motiyated to excel on academic taSKSto the same extent as children in the starfdardizati~n sa~~le . In ability tests. In a projective test requiring the subject to write stories 'fit given pictures. the test stimuli are used only for qualitative exploration. or juvenile delinquents. 57). 7. follow standardized procedures to the minutest detail. In establishing rapport. On the Clerical Speed and Accuracy Test of this battery. prisoners.hcn interpreting test results. testing children below the fifth grade. Still other kinds of tests may require other approaches. special factors to be considered include shyness with strangers. . and making such comments as ood" or "fine.d. and negativism. take testing conditions into account . and relaxed manner on the part of the examiner helps to reassure the child. Whether the exinel' is a stranger or someone familiar to the subjects may make a 'nificant difference in test scores (Sacks. The game approach is still the most effective way of arousing their interest in the test. But in all instances. the examiner endeavors to motivate the subject to follow the mstructlOns as fullv and conscientiously as he can. A friendly. 1956). in personality inventories. Under these rcumstances. Second. ey call for frank and honest responses to questions about one's usual behavior.. 1966). Natml.could readily be multiplied. & Kuze. loss of interest.. In the intensive assessment of a person rough individual testing. The shy. Children in the first two or three grades of elementary school present many of the same testing problems as the preschool child. they call for full reporting of associations evoked by the stimuli. distractibility. Although rapport can be more fully established in individual testing. The implications are threefold. an experienced examiner may occasionally dert from the standardized test procedure in OJ:der to eJi~it additional inrmation for special reasons. Third. separate s are provided for three of the five answer sheets. 'Vhen testing children from educationally disadvantaged backgrounds or from different cultures. as in other testing procedures. Especially when examined in an institutional setting. ld the responses should be treated in the same way as any other infor"malbehavioral observations or interview data. however minor. Tsudzuki. without any censoring or editing of content. but rather to wait until the child is ready to make the first contact. In another study. 0 praise. and other manifestations of negativism. job applicants typed 'a significantly faster rate when tested alone than when tested in groups liHwo or more (Kirchner. and nsure that he follows the standard test instructions.' anel USe' Of Psychological Tests 35 In psychometrics. elicit his cooperation.Context of Psychological Testing eets. steps can also be taken in group testing to motivate the subjects and relieve their anxiety. Any deviation from standard motivating conditions for a particular test should be noted and t.

they may have developed feelings of hostility and inferiority toward academic tasks. Unlike the schoolchild. . Hill. ~uch findings. Adult testing presents--some additional problems.v'ith tests.. Katz. Much of this research was initiated bv Sarason and his associates at Yale (Sarason. & Zimbardo. The experienced examiner makes special efforts to establish rappolt under these conditions. ~Iost persons will understand that an incorrect decision. does your healt begin to beat faster? While 'you are taking a test.Spielberger. because the examinee realizes that he himself would otherwise be the loser. Some reassurance should therefore be given at the outset. In testing any school-age child or adult. Of primary interest is the finding that both school achievement and intelligence test scores yielded significant negative correlations with test anxiety. offers general suggestions on how to take tests. contains items such as the following: Do you worry a lot before taking a test? \\'hen the teacher sa~'s she is going to find out how much you h. 1967). fl'ar. It is possible that children develop test anxiety because they per- . he must be sensitive t~ these special difficulties and take them into account in interpreting and explaining test performance. As a result of early failures and frustrations in school. Such explanatory booklets are regularly available to participants in large-scale testing programs such as those conducted by the College Entrance Examination Board (1974a. the examiner may read the instructions from a printed script.manifest a number of unfavorable attitudes. 1974. Longitudinal studies likewise revealed an inverse relation between changes in anxiety level and changes in inteJligence or achievement test perfonnance (Hill & Sarason. would mean subsequent failure.available. Sarason. :Many group tests provide a prdiminaryexplanatory statement that is read to the group by the examiner. and frustration for them. This approach can serve not only to motivate the individual to try his best on ability tests but also to reduce faking and encourage frank reporting on personality inventories. Samson. one should bear in mind that e\'e1')'test presents an implied threat to the individual's prestige. which might result from invalid test scores. 1974b). & Ruebush. do not indicate the direction of caUsal relationslllps. as well as a more extensive pretesting orientation~. & Shimberg. Similar correlations have been found among college st1tdcn!s (1. G. The children's form. It is helpful to explain. J'he examiner's own manner and a wellorganized. The individual might otherwise experience a mounting sense of failure as 11e advances to the more difficult items or finds that he is unable to finish anv subtest within the time allowed. 1960). Le. 1966. because the unexpected and unknown are likely to produce al1xiet~'. \1ore general orientation booklets aie also . An even better procedure is to announce the tests a few days in advance and to give each subject a printed booklet that explains the purpose and nature of the tests. smccthly running testing operation will contribute toward the same goal. for example. Individual differences in test anxiety have been studied with hoth schoolchildren and college students (Ga~dry& Spielberger. Procedures tending to dispel surprise and strangeness from the testing situation and to reassure and encourage the subject shottld certainly help to lower anxiety. for example. that no one is expected to finish or to get all the itcms correct. The first step was to construct a questionnaire to assess the individual's test-taking attitudes. the adult is not so likely to work hard at a task merely because it is assigned to him. Abnormal conditions in their past experiences are also likely to influence their test perforrnanee adversely. It is certainly not in the best interests of the individual to be admitted to a course of study for which he is not qualified or assigned to a job he cannot perform or that he would find uncongenial. In any event. the second contains practice tests. Davidson. designed specifically for job applicants with little prior testing experience CBennett & Doppelt. "'aite. and contains a few sample items. provides general information on how to take tests. loss of time. The United States Employment Service has likewise de\'eloped a booklet on how to take tests. although high school and college students also respond to such an appeal Cooperation of the examinee can usually . si'tc11 as l\feeting the Test (Anderson. 1964). A tape recOl'ding and two booklets are combined in Test Orientatioll Procedure (TOP). In the absence of a tape recorder.we learned. It is also desil:able to eliminate the element of surprise from the test situation as far as possible. such as suspicion. Lighthall. or cynical indifh'renee. do you usually think you are not doing wen.technique for use with culturally disadvantaged applicants unfamili~f. . It therefore becomes more important to "sell" the purpose of the tests to the adult. 1965). The first booklet. \rhich the tests resemble.be secured by convincing him that it is in his own interests to obtain a \. a score correctly indicating wh~lt he can do rather than overestimating or underestimating his abilities. used together with the tape. 19i2). 1961). of course. :\lany of the practices designed to enhance rapport sen'e also to reduce test anxiety. valid score. insecurity. for example.

but a deleteriouseffect on that ofbigh-anxious subjects. there 5 evidence suggesting that at least some of the relationship results from he deleteLious effects of anxiety on test performance. training and expenence. have yielded misleading or illconcluSl\'e results because the experimental designs failed to control or isolate the influence of differcnt examiner or subject characteristics. French (1962) compar~d Jhf'p.~~ch "aneffect. than with clearly defined and well-learned functions. should be distinguished horn the tesr:tiinit1!.r. be confounded. the w-allxiousgroup improved significantly more than the high-anxious." . 1964). .. Masling (l~60). the role of the examiner is especially cruCiaL. gationof this question.and test performance is nonlinear. Individuals who are '. Moreover.. Masling.Context of Psydl(Jlogical Testiug form poorly on tests and haw thus experienced failure and frustration in previous test situations. personality charaderistics. I.e test. however. such as the nature of th. sex. there may be Significant interactions between examiner and examinee' charact " . For example. Lighthall. It thus appears likely that the r~latjQn between anxiety. an.(. On the other hand. it is undoubtedl\' true that a ~hronicalh. f Comprehensive surveys of the effects of examiner and situational variables on test seores'lmve been prepared by S. and appearance. B. e~lstJCs. Moreover. calling attention to the possible inHirence of th t t· . In one study (:Waite.' ' St'll ' • '.tions t lat arouse some et:>. To what extent do~s test auxier. or a rigid and aloof versus a natural manner on the part of the examiner (Exner. 1974). In a thorough ana::4ontrol1ed investi.erformancc of high school students on a test given as part of the fe-gular administration of the SAT with performance on a parallel form of the test administered at . high-anxious and low. Se\'eral studies of thes~ examiner variables. such as his. Sarason (1954). Cohen. 1966. Feldhusen & Klausmeier. ~foliarty (1961. and Sattler (1970. In general.\lein bencficia~ while a lar e amount is detrimental. & Davidson. children are more susceptible to examiner and situational influences than are adults. the purpose of the testing. most of the data have been obtained with either projective techniques or individual intelligence tests. c es gIVers an the test takers' diverse perceptions of the funetiglls and goals of testing. The instructions on the latter . There is considerable evidence that test results may vary systematically as a function of the examiner (E.· .high amidv len'l will c:I. in the examination of preschool children.0). This is simply a P clal mstance of the self-fulfilhng prophecy (Rosenthal._". it has been argued that performance on c'OlIege ~dmissif>il tests may be unduly affected by test anxiety.III t he sen~e that the same examiner characteristic or testing mannel may have a dIfferent effect on different examinees as a function of the examinee's Own personality characteristics. howe\'er. occasion specified that the test was given for 'research purposes only and scores would not be sent to any college. The examiner's behavior before and during test auministration has also heen s~lown to affect test results. 'iotlschildren equated in intelligence test scores were given repeated ials in a learning task Although initially equal in the learning test. ~'Iasling. for example. race. 1966.\merica today. controlled investigations ha\'e YIelded significant differences in intelligence test performance as a res~lt of a "warm" versus a "cold" interpersonal relation between examllJer and examinees. d . 1966). Palll & Eriksen. These extraneous factors are more likely to operate with unstructured and ambiguous stimuli. had a beneficialeffect on the performance of low-anxious subjects. Sattler and Theye (1967).Sarason. a slight amount Qf anxiety .a different time under "relaxed" conditions. the concurrent validitv of the test scores against high school course grades did not differ signifi~antly under the two conditions.'performance level in nontest situations? Because of the competitive pre~sure experienced by college-bound high school seniors in . as well as "ith difficult and nO"el tasks. Severalinvestigators have compared test performance under conditions esigned to evoke "anxious" and "relaxed" states. 1962.~'ects with which this discussion is concerned. Rosen- . professional or socioeconomic status. 1965.1952). Similar interactions may occur '~ith task variables. Emotionally disturbed and insecure persons of an\' age are also mClre likely to be affected by such conditions than are well-adjusted persons. age. Mandler and Sarason . and the instructions given to the subjects.make the individual's test performance unrepresentative of his cust~mar~' . Palmer (19.other way in which an examin8r may inadvertently affect the ~x~~m~e s responses is through ~is own 'cexpectations.cllstomariy ow-anxious benefit from test con i. The results showed that performance was no poorer during the standard administration than during the relaxed administration. These differences may he related to personal characteristics of the examiner. 1966. the negative "rrelation between anxiet~' level and test performance disappears Denny. 1958). Hence thp l:'ffeds of two or more variables ma\. 1960). such as telling subjects that everyone is expected to finish in the time allotted. found that ego-involving instructions. ". 1962). Although some effects have been demonstrated with objective group tests. .erJ a deb'imental effect 'on school learning and' int~lIectual dewlopllleltf. Dyer (1973) adds even more variables to this list. 1959). Other studies have likewise foundan interaction between testing conditions and such individual char~cteristicsas anxiety level and achievement motivation (Lawrence. In support of this interpretation is the finding that \\ithin subgroups of high scorers on intelligence tests.hi e t lose who are customarilv hi<rh-anxiol1s )erform better Ii' firmore re axe can itions.

tape recordings of all testing sessions revealed no evidence of verbal influence on the part of any examiner. under these conditions. a fundamental question is whether the improvement is limited to the specific items included in the test or whether it extends to the broader area of ~ehavior that the test i~gned to p. E.es will in no way invalidate the test. they had again been writing. The examiners were 14 graduate student volunteers. followed by a reward of toys and candy.om those ~ffecting only a single a~lllinis~tj~n of a. This type of motivational feedback may operate largely through the goals the subjects set for themselves in subsequent performance and may thus represent another example of the self-fulfilling prophecy. 1965). snowed more improvement in their test scores than those who had undergone neutral or less gratifying experiences. the two groups of examiners obtained significantly diHerent ratios of animal to human responses from theh subjects. On one occasion. From the standpOint of effective testing. for example. Such broad influene.handicapping conditions. 1944).:. Military recmits. The answer to this ques~ represel1ts the difference between coacmng and education.. 1960). one of degree. while the other 7 were told that experienced examiners elicit more animal than human responses. 'whereby the individual is informed about the specific items he missed and given remedial instruction. In one investigation designed to test the effect of acclimatization to such a situation on test performance. In the majority of well-administered testing programs.and fourth-grade schoolchildren. either formal or informal.180 recruits tested at the conventional time. but obviously vary widely in scop~~f.ical Testing Natufa aile! Use of Psychological Tests 41 thaI & Rosnow. affect~ng'p~rformance on all Items . When circumstances do not permit the control of these conditions. averaged 4 or 5 points lo\ver than on the first test. 2. should be reflected in his performance on tests sampling the relevant aspects of behavior. s1)ould not be confused with corrective feedback. or other. on the second occasion. Similar results were obtained by W. of course. the conclusions drawn from test performance should be qualified.Irtai9rity of his activities. These findings were corroborated in a later investigation specifically designed to determine the effect of immediately preeeding experience on the Draw-a-Man Test (Reichenberg-Hackett. The examples cited in this section illustrate the wide diversity of testrelated factors that may affect test scores. Bridgeman (1974) found that "success" feedback was followed by significantly higher performance on a similar test than was "failure" feedhack in subjects who had actually performed equally well to begin with.ca /:crtUln. during a period of intense readjustment to an unfamilim' and stressful situation. narrow or broad.~se.40 Context of Psycholog." The IQ's on the second test. Obviously any educational experience the indiVidual undergoes. however. there was some evidence to suggest that IQ on the Draw-a-Man Test was influenced Qrthe children's preceding classroom activity (McCarthy. Influences cannot. In this study. 7 of whom were told. The examinees' activities immediately preceding the test may also affect their performance.tate piCture of the individual's standing in the abilities under conside~n. but this time on "The Wo~sLThing That Ever'Happened to Me. feedback is much more likely to improve the performance of initially low-scoring persons. Nevertheless~ the skilled examiner is constantly on guard to detect the possible operation of such factors and to mipimize their influence. especially when such activities produce emotional disturbance. children who had had a gratifying experience involving the successful solution of an interesting puzzle. In evaluating the eHect of coaching or practice on test scores. other aspects of the testing situation may Significantly affect test performance. Such general motivational feedback. When their scores were c'Ompared with those obtained by 2.724 recruits were given the Navy Classification Battery during their ninth day at the ~a\'al Training Center (Gordon & Alf. Davis (1969a. the influence of these factors is negligible for practical purposes. however. during their third day.single test.type.be~dassified as either. fOllowing what may have been an emotionally depressing experience. These differences occurred despite the fact that neither examiners nor subjects reported awareness of any influence attempt. In an investigation with third. Under these conditions. 1969). a workable distinction can be . 1969b) with college students. throu~hJib. since the test score presents an aar:a. in or out of school. Apa~ from the examiner. The difference is. that experienced examinel's elicit more human than animal responses from the subjects.()fi. to those mtfUencmg the mdl vidual's performance in the large . The examiners' expectations apparently operated through subtle postural and facial cues to which the subjects responded. are often examined shortly after induction. In a particularly well-designed investigation with seventh-grade students. the 9-day group scored Significantly higher on all subtests of the battery.. Perfonnance on an arithmetic reasoning test was significantly poorer when preceded by a failure experience on a verbal comprehension test than it was in a control group given no preceding test and in one that had taken a standard verbal comprehension test under ordinary conditions. fatigue. among other things. the class had been engaged in writing a composition on "The" Best Thing That Ever Happened to Me".. -An experiment conducted with the Rorschach will illustrate this effect (Masling. ~foreover. Several studies have been concerned with the effects of feedback regarding test scores on the individual's subsequent test performance.edict. 1953).

(. An example is 'provided by problems requiring insightful solutions which. This is especially true since the tests merely supplementary to the school record and other evidence taken into . 1971b. On the other hand. ot wrJ/i e results of the coaching studies which ha. the less likely is improve:nt to extend to criterion performance.3. G~ins in score are also found on retesting with pili:dIel -forms <1j the same tes~. and the amount and 'of coaching provided. .expected. 1941). among others (College Entrance Examination Board. Pike & Evans. the nature of the tests. A number of studies have been concerned ~. e investigation was conducted with black students in 15 urban and '"ral high schools in Tennessee. It should be noted that practice. 1968. to do well on the tests. Rather.schools (Yates et aI. it can be stated that a test score is inmlidated only when a ':'cular experience raises it withont appreciably affecting the criterion ~Lviorhat: the test is deSigned to predict.'eral years (see Quereshi. and both normal and mentally retarded persons have been employed. Item types on which perfo. 195:3-1954). should therefore be carefully scrutinized. in which the fonowing points were ade. Scores on such tests. 1972). When the same test was readministered in successive years.\ lustrated by the results obtained in annual retests of . the Col.Significant m~a. All agree in showing significant mean gains on retests. These studies covered a variety of coaching ethods and included students in both public and private high schools. once attained. t and is taught. 1968). Many of these studies were conducted by British psycholo.8-9): It should also be noted that in its test construction procedures. since the subjects may emplo~' different work methods in solving the same problems. 19711>. the greater will the improvement in test scores. To clarify the issues. It is not reasonable to believe that admissions decisions can be ected by such small changes in scores. . The implications of sucll findings are il.il instruction. .500 schoolchildren with a Yariety of intelligence tests (Dearborn & Rothnev..!ightthus signify normal ability in the one instance and inferior ability#}. unt b'): admissions officers.blance between test content and coaching material. Moreover.formal statement about coaching. As might be expected. "n America. independent investigators (Angoff. \in'S :". th~ median IQ of the group rose from 102 to 113. that the closer the re. profoundly influcllced by conditions at home and at school over thc years.'e thus far been completed inte that average increases of less than 10 points on a 600 point scale can . The studies have covered individual as well as group tests. Both adults and children. Individuals with deficient educational backunds are more likely to benefit from special coaching than are those 'ihave had superior educational opportunities and are already pre. On the basis of such research. as well as coaching. Thus. the more closely truction is restricted to specific test content.Conege Entrance 'amination Board. certain types of items may be much easier when encountered a second time. the College Entrance Examination Board has been conhed about the spread of ill-advised commercial coaching courses for lege applicants. the ~~ovement depends on the ability and earlier educational. may alter the nature of the test. although such gains tend in general to be . this particular Scholastic Aptitude Test is a measure of abilities that seem to grow slowly and stubb(lrnl~'. the Trustees of the College Board issued .a4Ier. apitude is not something flxed and impervious to influence by the way the child PRACTICE. can be applied directly in solving the same or similar problems in a retest. The conclusion from all"these studies is ':at intensive drill on items similar to those on the SAT is unlikelY to 'oduce appreciably greater gains than occur students are rete~ted 'th the SAT after a year of regular high schot.rma1lce can be appreciably raised by short-term drill or instruction of a narrowly limited nature are not included in the operational forms of the tests. For example. lege Board im'estigates the susceptibility of new item types to coaching (:\ngoH.COlltext of P~yc1lOlogic(/l Testing e. It is obvious. Becaus~ of the retest gains. p.the other. but it dropped to 104 when another test w~s substituted.a~ ~Q of 100 fell approximately at the average o£'lhe distribution on the Im~lal trial. -but in the lowest quarter On a retest~S\ldl iQ's.{CHIKC. 'ences of'the examinees.'ith the effects of the identical repetition of intelligence tests over periods ranging from a few days to se. ] 968). whether derived from a repetition of the identical test or from a parallel form. on test performance are similar to the effects of coaching. As the College Board uses the term.. but usuaIl~' less pronounced. The effects of sheer repetition. though numencally identical and derived from the same te~ 1l. the College Board conducted veral well-controlled experiments to determine the effects of coaching 'its Scholastic Aptitude Test and surveyed the results of similar studies other. . or practice. effects of coaching on test scores have been widely in'the gated. but not responding to hasty attempts to relive a young lifetime.n gams have been reported when altema"f~ forins ofa 'test were adrnullstered in immediate succession or after intervals ranging from orie . Nor is improvement necessarily limited to the initial repetitions.with special reference to the effects of practice and coaching on the brinerly used in assigning ll-year-old children to different types of 'Ilrv. the meaning of an IQ obtamed on an initial and later trial proved to be quite different.srh. \Vhether gains persist or level off in successive administrations seems to depend on the difficulty of the test and the abilit~· level of the subjects. too..

Part Ithis advantage stems from having overcome an initial feeling of angeness. Both report specific incidents to illustrate each prinCiple. high school and college students. 14.Context of Psychological Tesring b three years (Angoff.r results have been obtained with normal and intellectually gifted )children. x RDER to prevent the misuse of psychological tests.c~tions vary with the type of test. Some of the matters discussed in the Ethical Standards are closely related to points covered in the Standards for Educational and Psychological Tests (1974). Other principles that. . & Ebel. and employee samples. 1952).ri!'d of int~nsive training and s~pervised experience is required for the proper use of individual intelligence tests and most personality tests. Qf course. a relatively long pe. Test Interpretation. being concerned with Test Security. Short orientation and practice sessions. where the extent of test-taking experience may have varied Widely. are highly relevant to testing include 6 (ConfideIitiality). can be quite effective in equalizing test sophistication (Wahlstrom & Boersman. as well as from haVing developed more self-confidence and "etter test"taking attitudes. the reader should consult two companion publications. 'although broader in scope. ~dbe made when interpreting test scores."familiaritywith common item types and practice in the use of objective "answer sheets may also improve performance slightly.and richer understanding of the principles set forth in the Ethical Standards. as between the advancement of science for human betterment and the protection of the rights and welfare of individuals. The individual who has had ex'vl! prior experience in taking psychological tests enjoys a certain adJage in test performance over one who is taking his first test (Heim & . Bishop. Principles 13. Rodger. as described em'lier in this chapter. a "onthe distribution of gains to be expected on a retest with a parallel should be provided in test manuals and allowance for such gains . 1965. 1936). Peel. It is particularly important to take test sophistication into account when comparing the scores obtained by children from different types of schools. and Test Publication. For a fuller . 1951. whereas a mini~um of specialized psychological training is needed in the case of educational achievement 45 . it has become O necessary to erect a number of safeguards around both the tests themselves and the test scores. cited in Chapter 1. SpeCific . 1971b. Thus. 7 (Client Welfare). and 9 (Impersonal Services).194~1950. and 15 are specifically directed to testing. the code of professional ethics officially adopted by the American Psychological Association and reproduced in Appendix A. Millman. The general problem o(test sophistication should '"be considered in this connection. the Casebook on Ethical Standards of PsycllOlogists (1967) and Ethical Principles in tIle Conduct of Researc11 with Human Participants (1973). )17 SOPHJSTICATIO~. The requirement that tests be used only by appropriately qualified examiners is one step toward protecting !he indiy!~ual againE: the im~oper use of tests. Droege. IIace. The distribution and use of psychological tests constitutes a major area in Ethical Standards of Psychologists. Part is the result of a certain amount of overlap in the type of content and functions covered by many tests. the necessary qualiB. CHAPTER 3 Social a1ld Etltical 11JljJZicatioTls of Testi1lg I 1968). Special attention is given to marginal situations in which there may be a conflict of values. 1966.

the requirements are generally a PhO in psychology. Principle 2c). It should also be noted that students who take tests in class for instructional purposes are not usually equipped to administer the tests to others or to interpret the scores properly. the technical aspects of test construction have tended to outstrip the psychological sophistication with which test results are interpreted. p~rchase of tests is generally restricted to persoJl~ . The well-trained examiner chooses tests that are a )ro riate for 0 the particular purpose for whie 1 e is teshn an t examme. clinic. Usually ~pdividuals with a mast~r s degree in psychology or its equivalent qu~l. The principal f~nction of ABPP is to provide information regarding qualified psychologIsts. Not alT IU1sconcephons· about tests.Fdrnplexity of the science of psychology has inevitably becn accompani~. through . lie shpuld be sufficiently knowledgeable about the science of human behavior to guard against unwarranted inferences in his interpretations of test scores. Thus.who meet certam z:nlmmal qualifications.i~~' -SO'rtle publishers claSSIfytheir tests into levels with reference to user qt. and behavior genetics. In administering the test. with the diversification of the field and the consequent specialization of training.:hological testing itself has tended to become dissociated from~. 1967). at least as a consultant. The catalogues of major testp~1>lishers specify reqUlr~ments that must be met by purchasers. B~cause the in de endent ractitioner is less subject to judC1ment and eva ua on l' wle eable collen es t lan lS 1e lIlS Itntional s choloC1ist he needs to meet hi her standards ? . The growing. Violations of the APA ethics code constitute grounds for revoking a celtiRcate or license. job' applicants. and satisfactory performance on a qualifying examination. in psychology certification typically refers to legal protection of the title "psychologist. he is sensitive to the many conditions that ~ such as those 1 ustrate 10 apter 2.)~'c. ReeJuiring a high level of training and experience within deSignated specialties. such as learning. a specified amount of snpervised experience. Misconceptions about the nature and purpose of tests and misinterpretations of test results underlie Illany of the popular criticisms of psychological tests. was the enactment of state licensing and certification laws for psychologists. Although the terms '1icensing" and "certification" are often used interchangeably.dby increasingspecialization among psychologists. In this process. individual diffe. As a privately constituted board within the profession. Although most states began with the simpler certification laws. or government agency. there has been continuing movement toward licensing. these difficulties arise from inadequate communication between· psychometricians and their various publicseducators. such as a school system. and school psychology. Who is a qualified psychologist? Obviously. A useful distinction is that between a psychologist working in an institutional setting. the Ethical Standards specify: "The psychologist recognizes the boundaries of his competence and the limitations of his techniques and does not offer selyices or use techniques that fail to meet profeSSional standards established in particular fields" (Appendix A. ranging from educational achievement and vocational proficiency tests. e IS a so cognizant of the available research literature on the clioseiitest and able to evaluate its technical merits with reC1ard to such o character. ABPP does ~)()thave the enforcement authority available to the agencies administermg toe state licensing and certification laws. In either type of law. He draws conclusions or makes recommendations only after considering the test score (or scores) in the light of other pertinent information about the individual. which can also be obtained directly from ABPP. legislators. to provide the needed perspective for a proper interpretation of test performance.the· mainstream of behavioral science (Anastasi. howcyer.. university. and so forth. and one engaged in independent practice. can bc attrib_R!.istics as norms. Test scores can be properly interpreted only in the light of all available knowledge regarding the behavior that the tests are designed to measure. child development. industrial and organizational.ences.The. Above all. A Significant step. Nearly all states now have such laws. reliability. or by persons in other professions. counseling. In recognition of this fact. Licensing laws thus need to include a definition of the practice of psychology. speCialty certification within psychology is provided by the American Board of Professional Psychology (ABPP).~ to inadequate communication between psychologists and laymeD. no psychologist is equally qualified in all areas. both in upgrading professional standards and in helping the public to identify qualified psychologists. In part. parents.46 COllfext of Psycl1010gicaf Testing Social alief Etllicalll1lplications of Testing 47 or vocational proficiency tests. The same would be true of a psychologist responSIble for the supervision of other i·nstitntional psychologists or one who serves as an expert consultant to institutional personnel. When tests are administered' by psychological technicians or assistants. The Biographical Director~' of the APA contains a list of current diplomates in each specialty. At a more advanced level. and validity. ABPP grants diplomas in such areas as clinical.pro esslOna qualifications. . it is essential that an adequately qualified psychologist be available." whereas licensing controls the practice of psychology. Probably th~ most common examples center on unfounded inferences kdfrtIQs.al~fi~~ions. psychometricians have concentrated more and more on the technical refinements of test construction and have tended to lose conta:tt wit'rr developments in other relevant specialties.

ld t be published in a newsp. a .s 01 ~""h nllhlicitv may foster Social alld Ethical Implicatiolls of Tes/ing 49 either naIve credulity or indiscriminate lic toward aU psychological testing.'" gn h d a ou . aptitude. It IS the rfe Sl h to prevent obsolescence.)Jogicalests have often been. 'b'l' f' th oller use of the test. Xe\'er~~ •.'Context of Psychological Testing . ~ll safeguar t elr use who arc ualifled to interpret and q als. the conk ' . f" I ignment or for research must have t e . casual conversation.n to speCIfic test It~~S 'ght also be added that presentation of )Vithother 'persOJ~s.' 'sttibute ear y or reo . Certainly any itlteJligence. or . are rele::sed ~nl~ to perso~:sshould be noted that although test m properly (Prmciple 14)" I t these obJ'cctives. vith the nature of the tehst. The.: of the results obtained is relevant to the proper mtel pre a 0 at test. testing. l' 'bilihr concern.p a~i 'lorna-do not necessarily en a PhD. a~ld P articular test or that his a ' hat the indi\'idualls quah~ed ~o u. . or attitudinal traits are necessarily disguised. test results may be Worse than useless. . other personal '~llcoul1ter-m:lM' yield information about him that he wouldpr~fer to c.~: Ethical Standards state: ' ..qnCe. articular test or ~ c ass a~s h "ehology instructor. er professIOna lcsponsl '} h Id .r indirect testing procedures i~~ a grave responsibility on the pi. h Pure asers . F~~se61 ijf'te§. le ~ycho ogl~a y mJ~nou will tend to invalidate the future use of .1\ tes .ting cliee:tii\'ene~. does s9 only after making certain that the r:esponsible person is fully aware oflhe purposes of the intervjew. or achievement test may reveal limitations in skills and knowledge that an individual would rather 1Totdisclose. who as" order countersigned by t elf ps~ . Although there are few available tests whose appr~1ts subtle enough to fall into this category. '1 limited. • 01 S ou '. efforts to Imp emen 'b'l' utors ma 'e SllleCIe .e a d vice:. t . Moreover.):h his l'esponses on any Oue test are to be int~fpreted. . " If 1 t' on would not on y b e '00.1 d prevenhon of mIsuse. necessary to keep the examinee"in'1gnQ. b' t" c"l'dence 'I\'hen a o f fficient 0 Jec lye. eq. ' t of tests s ou no If ~~ ma °or UI:l'Sbook either for descri tive wrposes or forI SC e. . but it might . or. or who allows such infonnation to be divulged to him. .S:'lb~tIOl~~e data to permit an evaluation manual should pro\ 1 e a. any observation of an individual's behavi@r-'tt'~ in an interview. ~or exampleA~. ass to t I' In d'1VI. this condition should d If search purposes 0 . .. resistance on the part of the pub- Another unprofessional practice is testing by mail. sponsl 1 Ity 01 e'pr.'stricted accordingly. . re ardin administraUon. . any pub.e ti . suchI drastic errors . i ' dorms 0 ten enoug r to reVise tests an n d t d 'II of course var)' . idity with wlueh a tes t be c ames out " e WI. An individual's performance on eithel' aptitude or personalit~· tests cannot be properly assessed by mailing test forms to him nnd lla\'ing him return them by mail for scoring and interpretation. .desi ed t~.'th .E. f h a dual objective: secunty to restrict the distn~uboll o· ~ests . est itself as well as full il1fo~~n~tton factal e~OSitiOl1 of what nd norn1S.f~~ the speCific ~.. f the test n. the subject may reveal characteristics in the COurse of such a test without realiZing that he is so dOing. .1Jt'r_ a son should not be subjected to any testing program under false pretenses. Of primary importance in this connection is the obligation to have a dear understanding with the examinee regarding the use that will be made of llis test results. II ' .The manual S IOU . the possibility of developing s'i1~1. )onsibility of the test . The fact that psycI11. .It is evident. The major responsl 1 Ity Yare able to exert IS neeessan y h ' d' 'dual uscr or institution f 'd in t e 111 IVi proper use 0 tests resl es h t MA degree in psychology ~ed.l1 and that I¢ may reveal unWittingly.. testing. f a test in the absence 0 su nI\. state hc~nse. Under these COndltI~:\. . y specified and the d~. 1 .~t1t the 'b t tlle test rather than a sellmg c .Jellowing statement contained in Ethical Standards of Psychologists (Principle 7d) is especially germane to this problem: The psychologist who asks that an individual reveal personal information in the COurseof interviewing. Insofar as some tests of emotional. Under these conditions. 'dual Moreover .aper. 1 13)' "Test scores like test ' d h' "( Pnnclp e. Graduate students who may e etween In . De. choIogist who uses them. Not only does this procedure provide no control of testing conditions but usually it nlso involves tIle interpretation of test scores in the absence of other pertinent information about the individual. 'entories to such clinical instrultelligence tests and mterest In\ t 'ersonalit tests.t be released preI d blishers Tests s Oll no tests by aut lOrs an pu . or evaluation and of the ways in which the information may be used.aut or an 'favorable lIght.~. Although concerns about the invasion of privacy have . A question arising particularly in connection with personality tests is that of invasion of privacy. I' 't d to persons \\1 professional mteres s to such deVices IS ImI e .been expressed most commonly about perspnalit)' tests. Distincs individual intelligence tests al ldmOhsPers alld a~thorized insti' d' [lUre as e alsohma db' of appropnate 'idua 1t s. motivational.eW~~~~j~~ \vorthless. they logi<:ally apply to any type of test. I' be made regardincr the ' ' 1 N" h Id anv c aUllS b V for <renera use. m~ to create an erroneous and distorted prials in thIS fashIOn ten . or evaluation. t atena san ' .s the Il1arketing of psvcho.

e.popular fears and suspicion would be lessened. Anonymity does not.oblems. tliat he be shown the test items in advance or told how specific responses will be scored. only general guidelines £illl rovided. with 110 mysterious powers to penetrate havior. Conflicts may thus arise. 8a. prepared for the f Science and Technology.d Jo distort responses on many personality tests. Not only would the giving of this information seriously impair the usefuhless of an ability test. and it has been the subject of "e delibemtion by psychologists and other professionals.When tes ng IS con uded for institutional purposes.about the purpose of testing. An important implication of this principle is that an practicable effOlts should be made to ascertain the validity of tests for the particular diagnostic or predictive purpose for which they are used. The examinee should certainly be infoJ'l!le~.In the application of these guidelines to specific cases. To safeguard personal prijno universal rules can be formulated. or research. It is not implied. should not be made available for instihltional purposes. 1973. Whatever the purposes of testin tlle rotection f riva two Key concepts: re evanc consent. he clinician or examiner does not invade privacy'where he is T eelyadmitted. which must be resolved in individual cases. 1973. Ruebhausen & Brim. his feelings. It also behooves the examiner to make sure that test scores are correctly interpreted.£. unless the examinee gives his consent. or he Irony disclose feelings of which he himself is unawar . however. It is also desirable. Even under these conditions. of course. institutional decisions regarding~~lecand classification. presents the possibility of invasion '. even when complete confidentiality of responses is assmed. 'Id also bc noted that all behavior research. to explain to the examinee that correct assessment will benefit him. problem is obviously not simple. All research OIl human behavior.of behavior samples. the kinds of data sought.ith adorpinance .f informed consellt also requires clarification. If all tests were recognized as .tions.·ill be best served when he investigates judgment indicate~ investigation is needed. boutit would alsotcm. whether employing het-observational procedures. the right to privacy is defined as "the the individual to decide for himself how much he will share with histhoughts. are concerned with the protection of privacy 'the{velfare of research subjects (see. may present conflicts of values. For ~xaQJple. 1967. misconceptions about tests.as scientists. psychologists are committed to the goal of g. 16). cooperation of subjects may be elicited if they are convinced that the information is needed for the research in question and if they _ have sufficient confidence in the integrity and competence of the investigator. since it is not to his advantage to be placed in a position where he will fail or which he will find uncongenial. the lfiaffiinee Isbouldbe fully informed as to the use that will be made of his test scores.g. amplesof such confl. In a retitled Privacy and Be7IGvioral Research (1967). 2). 7d.determination"-(p." Several other printhe other hand. however. Some subjects may resent the disclosure of facts they consider personal.. must be balanced against the protection of the individual. Freedom of inquiry. whether or not it utilizes tests." The concept. 1966). An individual is less likely to feel that his privacy is being ~aded by a test assessing his readiness for a particular educational progrlfm than by a test allegedly measuring his "innate intelligence. When tests are given for research purposes. and its application in individual cases mav call for the exercise of considerable judgment (Ethical Principles. The results of tests administered in a clinical or counseling situation. if an indi®~.:wifi be made of his scores. however. which is essential to the progress of science. and the use tha1. Yet. Privacy and Be1lGvioral Researc11. The information that t e m iVl ua is asked to reveal must be relevant to the stated purposes of the testing. solve the problem of protecting privacy in all research contexts. Nor should the test items be shown to a parent. In most cases.... Solutions must be worked out in ter~ p£ :particularcircumstances. The investigator must be alert to the values involved and must carefully weigh alternative solutions (see Ethical Principles. and the facts of his personal life" (p. . 2). however. 1966). Suc~ infonnation would usually invalidate the test. fllrthercharacterized as "a right that is essential to insure dignity reedomof sf>lf.ict resolutions can be found in the previously ical Principles in the Conduct of Research tcit11 Human Pars (1973).Ruebhausen & Brim.Il/('xl (If Psychological Testing lit in discussions of the invasion of privacy probably reflects . anonymity should be preserved as fully as possible and the procedures for ensuring such anonymity should be explained in advance to the subjects. however.knowledge about human behavior. In clinical or counseling sit1.~l is told in advance that a self-report inventory-will be scored v. :'nerelevant factor is the purpose for which the testing is conducted'ther for individual counseling. Principle 1a in Ethical s of Psychologists (Appendix A) clearly spells out the psycholoViction"that socieh' v. the _ t is usually willing to reveal himself in order to obtain h~]p with his . th~~~ substitute for the ethical awareness and professional respons~i{9 Ie individual psychologist. in the case of a minor. the client should tie warned that in the course of the testing or interviewing he may reveal :informationabout himself without realizing that he is so doing.j. An instrument that is demonstrably valid for a given purpose is one that provides relevant information.

'S" nt'ly reduced when ' ff nsive 15 slgm ca ''der some of th e.!lich representation a conse . free from technical jargon or labels. This presents a possible conflict with the child's own right to privacy. and achie".a out eir child. court. 'While avoiding rigId preSC." he should have the right to deny parental access to his records. conseiitOf • the individual. Counselors are now trying more and more to involve the client as an active participant in his O\\'n assessment. For these purposes. 27) recommend that uwhen a student reaches the age of eighteen and no longer is attending high school. a IDENTIALITY . how th~ results - _. tllat this gg rom both national and stateWide .. which it is related. Womer. buttressed with reasons of ethics..bi.d. 11 re ereo .es~ons anse "':1 . ~..p'J)se~ likely to bbeIn thu~n:ait or by a false or distorted are as'he may have a ou t IS . o. t enresents an mvaSI . There is also a selected . f g educatlona .' "on of privacy or .ou comes a be achieved.. an.and DissenunatlOl1 0 tip' ..'. on the ethical and ~ega alsPdec ~ that protect the indid 'penmenta eSlgns rocedur~~ a~o pe:rucipate and that adequately safeguard his t . both to fill in background data and to elicit parental coope. Proper safeguards must be observed against misuse and misinterpretation of test findings (see Ethical Standards. Parents 1 norma y have a legal right to information. 1970)..t. ologist's ipgenUlty.. or employment setting. ' f P '1 Recor s. . 't: nd in the . .. 1 Sf'" the personality 'on did not affect the mean profile 0 scores on . p.. 'th regard to pae the Russell ~~:(i~. or both. Principle 6.' .Social and Ethical Implications of Testing 53 of Psychological T('sting . I nt should e su Cleo.32) wrote: uShould not a child. Russell Sage Foundation.1:'~.~mcnt tests as examp es °b em' t. ng of children. 1970. .sed sampling and voluntee~ error. Principle 14). In some cases. d' "d al consent. 1971. Discussions of ~he ~n6dentiality of test records have usuall~ dealt with accessibility to a thIrd person. or is married (whether age eighteen or not).tween l~t:~~o:al consent. He should also lave e opportum to comment on e contents of the report and if necessary to clarify or correct factual information.~:~ath(' number of respondents 'who ere Is-also some eVI ence .infonne~ at the time of testing regarding the purpose 6f~!he test. e. seem to command that this be done. a personahty llwen ory r 1" .-.. p the Guidelines board.~e~rotectlOn ~f p~lVacYiftf:ceted. .d~di::I7:rc tfite COeelle~~i~~: . 1 t however t e num er titudes of mutua' respec. and oriented toward the immediate objective of the testing. at the " .appoll1t~._-- . and the need of various persons to know the results. . In a searching analysis of the problem. the hazards of misunderstanding test scores.ds. nt. test results should be presented in a form that is readily understandable. " ". the question is not whether to commUDlcute test results to arents of a minor but how to do so.r.YLitemL :preceded by a Simple.SUlvley~ . personality ~ss~ssm~~i~:lilles is the inclusion of sample helpfu eature. 'bl ' tity The technical difmay be reduced to a neghgl e quan ' h'. one must also consider the parents' right of access to the child's test record. other than the in~hjdilal tese~d (or parent of a minor) and the examiner (Ethical Stando.to eCme. . ~r~~e resentatives. . .lid' 't hould be' adde~~~"t sue an standpoint of test va Ity. such 'arents: legally elected ~r . an .~~~U~~iS~~:~r:i:. .. t' cite . The fundamental question is: tiahty of test ata ISmu {ts? Several considerations influence the all hav~ access. 11 I .•. even before the age of full legal responsibility. as in a school system.n~. be accorded the dignity of a private personality? Considerations of healthy personal growth. ~ e~~ tten consent. . special qU. . f 1 data resent a challenge Hevielding scientifically meanmg u 'tP d the establish. may : USe. 1972)._~--_.'. The underlying principle is that such records should not be released without the knowl~~~. ' . and ~orthrJ ~:d.·. -In the case 'of minors.:. especially in the case of older children. However.forms.lpti~n:h type of instru. the counselor's contact WIth die parents IS of prime importance." The previously mentioned Guidelines (Russell Sage Foundation.:~' -. '.ration._ _. fl d by stereotyped (and often . 4314. h . . the Gujdelines differentiate b. 'Vhen tests are administered in an institutional context.(Fink & Butcher. 1 pr h h b of refusals to .Q. his'tiparents. b avoided. Ruebhausen and Brim (1966. Under these conditions. a child's academic or emotional difficulties may arise in part from parent-child relations. and it is usually desirable for them to have such information. '~itiv~area of pers~~allty . . the indi~dual should be . 'hild.possjble exceptjons. the problem of t.for obtammo ' 1 t of school record keeping. this recommendation is followed by the caution that school authorities check local state laws for possible legal difficulties in implementing such a policy. moreover.ltems 0 e " : ex lanation of h. bot III t':s III rch (Holi:zn~~n.' h . There has been a growing awareness of the right of the individual himself to have access to the findings in his test re ort. . pp..:~. Apart from these. tent. W'th oper rappor an c. to t~t resAmung them are the security of test con~ in particular situations. ted and I 0\ ores WI I be mterpre_ .

. or d' 1971 P 42) contain a sample uidelines (Russell Sage Folun :tlOn"n de~ri~lg the transmission. 1970. . pp. (Russell Sage Foundation. to maintain permanently.ad f th UncJ:e. a three-file system of computer tapes was devised.~and uf all other ~ersonal da~:n~avlen b~~.g the destruc. sch~ol recol' s. ."'. The unprecedented advances in storing. Such elaborate precautions roi'the protection of conlidentiality obviously would not be feasible except in a!aJge-scale computerized data bank..0 f' t' 'd' the warmth of human reeol. As is so often the cas~.. n m::uses as inl~rrect inferences o rleords opens t~e way f~~ s~ch for otber than the original 'solete data atld. containing each student's responses marked with an arbitrary identincation number. d 1 .. anot\l. ""II" . Rather than fearing the centralization and efficiency of complex computer systems.:~~e eta the Guidelines . Known as the Link file. was originally housed in a locked vault and used only to print labels for follow-up mailings. \fnrlrrn sciellcehas !!ivenus . The procedure could be simplified sQmewhat if the lin\ing faCility'· were located in a domestic agency given. To permit the collection of follow-up data on the same persons while preventing the identiflcation of individual responses by anyone at any future time. and to communicate both Widely and instantly.nres:~ch urposes.dt~: equired. f h 'd'" d al or for ac. . 1970:W' . This two-file system repl'esents the traditional security system.oblem pertains to the l'ete~tlO~l? recor s I vcr' valuable. processing. we should explore the possibility that these very characteristics may permit more effecth'e procedures for protecting the security of individual records..decoding files and the research data files under: the control of different organizations. i . Modernscience has mtl'Oduce h tr t allies of privacy were the in' 1e among t e s ongcs . containing only the students' names and addresses with the same identification numbers. The potential dangel"s of invasion of privacy and violation of con~dentiality need to be faced squarely. ~. Iml ar . retention. " I 'the interest 0 t e m 1'111 u late longltudma use m them should be subject to unusual¥i table research purPloseCs. On er pr.'.cdcej:setso (Russell Sal1e foundation._.' b ed for purposes that the individrs. with the agreement that the file would never bC. haKd the availability . I ~ wou Id em: d b a child in the third grade reading achle\'t>ment sco~e. d b th the passmg 0 tme an ' lhatat'compame. it contained only the original identification numbers and a neW set of random numbers which were substituted for the original identification ~umbers in the name and address file. f I S' '1 Iv when recor s are r scores meaning u. questionnaires were administered annually to several hundred thousand college freshmen. d to llim in t Ie mtervemng ) d etained fo'l"many . which substitutes one set of code numbers f~the other. 'b'1't f personal recor s. tea of 1 1 I • . when recor s are re ame a I1revent suc I1 mIsuses. An example of what can be accomplished with adequate facilities is provided by the Link system de\'eloped by the American Council of Education (Astin & Boruch. inclu~jpg the American Council on Education..r'these conditions. inent of computenzed. and acceSSI I I Y a 't nd accessibility of test results '" The pro bl ems 0f mam. a third me was prepared. It still did not provide complete protection. such files a-re subject to judicial and legislative subpoena.a. I" 'ears to ma e suc eaI' . The second tape. the original questionnaires were destroyed.er is rdevance to ree of objectIVity and ven a 'J 1 I Id be . l' 't policies regardit. . major det~~~ilih~ of the data. Therewas a t Ime W 1 n . ..se t1 othe. be . :t!on. a ' . is readily accessible for research purposes. Follow-u. . 0 . nO further penms~lOn e nee or em. . For these reasons.e d at tlIe me 1 r uested by outsiders.t11ere dan!!:er that tbey ma): edu~nd would not have approved. . . t troIs In t Ie w e In I:> d t ngen can .. ~'Ioreover.:~:~t~o:. ti results are made avalla e Wit III Ie . to retrieve promptly.Social alld Ethical Implications of TCStillg 55 'ntext of psychological 'd Testing nd their availabilih' to institutional personnel who h~v~ a ISC .:~~:~e~. advanta es resuppose proper son. f . obtalOe II Ye Too much may have hapn evaluating him for admISSion to co eg 'k h ·1' and ""'lated .le)eased to anyone.:adequate protection against subpoena.0 '1 . 'nt situation exists when test resu ts are eq t It from " 'm lover or a college requests tes resu s "~. aata . . no one can identify {he responses of illdividuals ~ the data files."". hi 'h' t1 institutiOn. interpretation •. d the healing compaSSion . are c aSSli~in factor in this classification is the 'I" retenti~n. . 1970). in a foreign country.p data t~p!s are sent tq the f{)reign facility.s-e for any type of . constructively. 5-6) Ruebhausen wrote. ind~v~d~l~~e~o. . ta. tenance secun y..hand. h f II'b'n" f hiS memorv an efficiency man. longi1tudinal rec~r~s a~:o l:~~I~::~~:t~~~ing ani'counseling y for researc I purposes u . d'n institutions. 'fi d bv the develop- the c:lpacity to record faithfully. The previously and coullsehng contexts. and imaginatively. f test resu ts. 'fi d into three categories-with regar· 0 t2).stitution to fonnulate SHm ar exp lCl d' . With the . 1 b' ti f the schoo.. is (or his parents) never suspecte 'd t' d either for le. of iformfor the use of schoo sys ems I . The Link file was dcposited at a computer facilit). . In a longitudinal research program on the effects of different types of college environments.gpurpose. . e educationa 0 Jec ves ." . th. After the preparation of these tapes.~unauthollze 1 acbcessd for example to cite an IQ b anifest v a sur . . since some staff members would have access to both files.R:::~~t~~:s: i~st~nc~s. d a new dimension into the issues of pnvacy. T~e same r~qUlre%~. t wou . and retrieving data made possible by computers can be of inestimable service both in research and in the more immediate handling of social problems. The first tape.

when he is learnin? about hfs 0'1\'11 assets . Hence w the term "minority" is use(i "fu tnis section it will be understood to includj) men.g.e individual himself. cxc... two major gll1del~nes are of particular mterest. Such de~nmental effects may. Written'reports about their own children may ributed to the parents. In conn~t~on with mechanisms for improving educational and vocational opportumhes ~f such groups. ga I. :.~ ..slbon papers by professional associ. A gifted schoolchild might develop habits of laziness and shiftlessness. insofar as ossible.d also be available for counseling anyone who may become cmOti01~any dIsturbed by such information. scope of ?~present discussion. Is of performance and qualitati\·e descnptot~ns 111 Sllnple terms preferred over specific numerical. personal I' involvement with the child may interfere with a calm and 'cceptance of factual information. .vir'll1:1l !!iven hiS own test results. .tit>ns (see. . An Important consideration in counseling relates to the' counselee's ~cceptance o~ the information presented to him. Among the more clarifying contributions are several po.more serious misinte )fetation )ertams to the conrawn from test SCOl'es.' a concern that is reflected in the enactment of civil rights legislation at both federal and state levels. the sort of th'tt"t mav reasonably be drawn from the results... applies o at person's general educatIOn 1~:imowledge about psynd testing. Ch. 1969. Even when a test has been accurately administer:d and scored and properly interpreted. ~ehow they afe transmitted..ocess ~re be}'o~d the.. whose impact ran. a college student mIght become seriously' discouraged when he leams of his poor performance on a scholastic aptitude test. ~he decades since 1950 have witnessed an increasing publIc concern With the rights of minorities. or he might become uncooperahve a~ld unm.nr1. of the person who is to receive the i~fomlation. Counseling psychologists h~e been especially concerned with the dev~lo ment of effective wavs of transmittin test inform' to-their-_ c IC11t5 see. For example. then that information is likely to be totally wasted. if he discovers that he is much brighter than any of Ius asso.clates. A severe personality disorder may be precipitated when a ~aladlust('d individual is given his score on a personality test. ~. American Psychological Association.vithany parents wishing to discuss the ~epol'ts further ..i. The person's emotional reaction to the mforrnatlOn lS ly important.~c~upalJonallY'in in otlu~r ways.Social and "Etl1icalIIll1"ications of Testing 57 i$tshave given much thought to the comm~nication of test "formthat will be meaningful and useful. for. 1971. but faclli~Ies shoul. however.But a. not is I onl~. a recommended to arrange a group meeting at which a counselor or school '\explains the purpose and nature of the tests. & \Ves- Ie tlthou~h ~omen repre)'lnt a statistical majorltyjn the nati~~al population.. occur regardless of the correctness or lllcorrectness of the score itself. and int~Fts~ ratlOgs With 'ores. for whatever reasons. It IS clear that the should not be transmitted routinely..ll1icatingresults to teachers.they have s~ed Jllany of the problems of mmoTlhes. of course. Although the details of .y. FI~ test-reporting is to be \'iewed~ as an mtegral part of the counselin rocess and incor orated into the o a counse or-c lent relationshi . a knowledge of such a score WIthout the opportunity to discuss it further ~nay be harmful to the individual. s~ould the data be interpreted by a properly qualified person. In the case of a parent or teacher. and the of the d~ta.. 14-16). ut by no means least is the problem of commumcatlOg test re'. 'H1". nee test· which are more likely to be misinter reted than are 't tes . example.even w en their te.g. I 1 I II II T~ SETfINC. an Important condItIon resu1tsshould be prcsented in terms of descriptive performrather than isolated numerical scores. Goldman. ' I . norms with standards. . of course. similar safeguard~ shoul~ b~ proVided.te persons.ges from clanflcabo~ ~o obfuscation. but should be accomnterpretive explanations by a professionally trained person. Th~ psychological literature of the 1960s and early 197?s co~tams many dI~cussions of the topic. A familiar example is the popuhyassumption that !cates a fixed characteristic of the individual wl)ich pTedeis lifetime level of intellectual achievemen~.. approprig.. psychological testing has been a major focus of att:nbon. with lQ's. Se d. Ev~n well-educated ye been known to confuse percentiles WIth Q~~centa~e scor~s.:pt when comg with adequately trained professlOnals . for example. and arrangements made for personal '..-tfu~ pr.ral . scores. This. . The same gene.c:nnical meaning is mderstood. emplo'yers.W:o account the char.litcommunication it is desirable to take .'s against misinterpretation apply here as in ~mmuni~tm~ ird party.anageable. whether child or adult. e. This is especiall}' tnu::. Kendrick. The counseling situation IS such thaf If the individual rejects any information. school administrators.. imicating scores to parents. Humphreys. but also to his anticipated eIllotional response to the on. Cleary. test results shou e reported as answers to specific !:lucstions raised bv the CQun~.. e.

In testing culturally di"h·elt·seffPerst i:e~~ °bno~h . 1 I f as Cll ture alIec s e . . & Whiteman. for example.~_ r ~veJl n tests. HarcDurt Brace Jovanovich.. naleand testh~g procedures ar: ~1~~~:~e~i:'?7s giv:n in Ch~pter 7. women's organizatiDlls have objected to the perpetuation of sex stereotypes in test content.ereb Y lower Its . FACTORS. . the appropriate norms may be general nDrms~. ~:: unrelated to cntenon per£orm~n tu' of obl'ects unfamiliar in a particular -ample the use of names or piC res . s'deration SpeCial en arts s ou .€fit~rion o th fcular test but me 1 evan O~ __ .. Another. ·terionbe laVlor an ose w d ~ Ex~mples of such atter. al mi~nlrstandings about tIle nature anddfllnfction of ps~chdol'Vll?j~~ls .RELATED . The most frequent misgivings regarding the use Df tests with minority group m~w:bers ste~ from misinterpretations of scores.. d h di ex l' T ld obviously represent a test-restncte an cap.n:dS: A d~b1e proc~~urea:\~u~~~. An example Df the application Df these procedures in item construction and revision is provided by the 1970 edition of the Metropolitan Achievement Tests (Fitzgibbon. 1964. and other psyc ~ O~IC: for the problem ou members. Some of the propose so u Ions .pertain to the interpretation of test scores. culturally restricted. variable. one cDuld say that we have been not so mueh culture biased as we-have been 'culture blind'" (Fitzgibbon. As one test publisher aptly expressed it.~:e~c:. or stereotyped material. quantitative t m mg oes .. .' .stmg. The major test publishers now make special efforts to weed out inappropriate test cDntent. re l\~e va 1 .ii:ctilfural bacKg~.fectingperformance on . if the development of arithmetic ability itself is more strongly fostered in one culture than in another. pp. too. t d to the test It is I' d th hDse in uence is res nc e .. By far the most important coflsiderations in the testing of culturally diverse groups-as in all testing -. . poor reading ability. In the same vein. . Depending on the purpose of the testing. cultura mlleu wou h' k' d not depend upon fami!.. who sometimes clumsily violate the feelings of the test-taker without even knDwing it. d" d' Chapter 2 Retestinl1 WI a _ ~--~ d tape recor mgs cite III '.:. Their Dwn test construction staffs have becDme sensitized to pDtentially offensive. res a beh:wlOl' samp e.oUP..t. nso ar If 1 ut aU cultural will and should be detected by tests. . 1972. an infel~i'St:ore on an arithmetic test could result from low test-taking motivation. "Until fairly recently. an~ an y 0 tet ~_-i<c. among other reasons. '1Jl~use of t~sts 1972). 2--3). 'th parallel form is .1972). If a minority examinee Qn~l:li~sa low score on an aptitude test or a deviant score on a personality):est.investigate why he did so. On the other hand.e pa~ I h ld be m'aae the opera. :heoretical . . more subtle way in which specific test content may spuriously affect performance is through the examinee's emotional and attitudinal responses. North. mo~.e~st~t~e:~ also~n~~e:~:s:e. toreduce ~prl. . FDr example. Certain words. 1972.Jlotms based Qn - . llnnectlOl1 Wit test va I I y. ~~I~~~ural 'of the concern centers on the lowenng 0 es sc . And the reviewing of test content with reference to possible minority implications is a regular step in the process of test construction. --_.ti: eoE otivation. I b kgroun s 0 groups or 10 iflerencesin the expenentia ac hI' 1 test ~itably manifested in test performanlce. we may th. INTERPRETATION AND USE OF TEST SCORES. Ev:rytPsbychaoVl~o~C~ts in. may alienate a child reared in a low-income inner-city home. Exclusive representation of the physical features of a single racial type in test illustrations may have a similar effect on members of an ethnic minority. most standardized tests were constructed by white middle-class people.. d . chnicalanalysIs of the concep 0 h t h ter our interest is . ~d I' when testing persons wltn diSSimilar lion of these test-related factors . f inDrity groUp wily in the basic issues and SOCialImplications 0 m ·ng. '. A brief but cogent paper b~ F~augh Tl:: Iso helps to cle~r away some preval~nt S~ll. ~~d iarity with such objects. Stories or pictures portraying typical suburban middle-class family scenes.. scores on an arithmetic test should not eliminate or conceal such a difference. Some thought should also be given to the type of nQCWsto be employed in evaluating individual scores.er problem of cross-cultural te.!2gl.~d\Y the booklets akingorientation and prehmmary prnc iCe. t . attitudes. Members of different ethnic groups participate either as regular staff members or as consultants. .ne~t lofc~.:~~e:. .e~7c ~:::o::~. or inadequate knowledge of arithmetic. Ability to carry out. Ph h had little or no IsoreeDmmended with low-seorin examm s w a ave TEST. In a way. Kogan.. tSst-related actors that.C~So~.'d adequate test. 1 citron the ause the testing of minorities repr~sent\a sP:~~~l ~~.. it is essential tQ.Social and Etllicallmplications .: :~~~e. t rns that ma)' have affected the devel~p. b cultural factors t a a ec rtant to differentiate etween . 'onlcxt of Testing 59 of Ps!}clIOlogica1Testing '5' Deutsch Fishman.h l'd't In t e presen c ap .ah~?t case the test behavior domain it was deslgnc d to assess. may have acquired connotations that are offensive to minority groups. ion to erEorm to~sinclude previous experience m ~akmg tests.we ~ e. t e U full) in Cha ter 12. 0 as a measure I1tials Hom a test. as in the portrayal of male doctors or executives and female nurses or secretaries. n 'fail to provide the kind of information needed to correct the very 'ionsthat impaired performance. rappDrt with the exammer..

p. 1964. B. These are very often chffaren whose cultural handicaps are most evident in their overt social and interpersonal behavior. 1966. with backgrounds different from those of their teachers. the Guidelines for Testitlg Minority Group Children (Deutsch et at. It should also be noted that the use of individual intelligence tests like the Stanford-Binet.'Chairman of the United States Civil Service Commission (7. \Vhen properly intcrrireted. tests provide a safeguard against favoritism and arbitrary or capricious decisions.¥:. and 1965.of t~sts was aptly characterized in the following words by John Macy. It was the mass testing and routine use of IQs by relatively unsophisticated persons that was considered hazardous. p.Jhe IQ that in 1964 the use of group intelligenGe-testS-. Gardner (1961. Loretan. and they couldn't hean the accents of the slum. It is largely because implications of permanent status have become attached tq. intelligence tests-and any other test-may be regarded as a map on which the individual's present position can be located.. . the IQ is an index of innate intellectual potential and represents a fixed property of the organism. Jr.f~~!f. When combined with information about his experiential background. many such children would he stigmatized by the adverse subjective ratings of teachers who tend to reward can· formist behavior of middle-class character. As will be seen in Chapter 12. pp. which are administered and interpreted by trained examiners and school psychologists. about the fixity of the IQ is a revealing commentary on the tenacity of the misconceptions.M:as discontinued in the l\ew York City public schools (H. The tests revealed intellectual gifts at every level of the population. 139) contain the follOWingobservation: Many bright. in contrast to their low classroom marks. On the conhary. 4&-49) wrote: "The tests couldn't see whether the youngster was in rags or in tweeds. \Vith regard to personnel selection. make favorable showings on achievement tests. non-conforming pupils." In the same vein. the contr!!>ution:. 1966). Without the intervention of standardized tests. this view is neither theoretically defensible nor supported by empirical data. That it proved necessary to discard the tests in order to eliminate the misconceptions. "'hen social stereot:'pes and prejudice may distort interpersonal evaluations. 883) :""'. .Social alld Et!lical171lplicatiolls of Testing 61 an IQ would thus serve to perpetuate their handicap. Commenting on the use of tests in schools. was not eliminated. Cilbeli. OBJECTIVITY OF TESTS.. According to a popular misconception. test scores should facilitate effective planning for the optimal development of the individual.rg Public Policy. intelligence test scores should not foster a l'igid categorizing ~f persons.

llv developed tests.~sessment and complemented by 'ction with other tools of perso~n~fi tl.'ofthe appraisal methods they must submit to. color.es this part ar~ based o~ ~e: can significantly contribute to the in fzedemployee selection proce u I I'CI'es as required bv Title ' ' .I .ed by the ~q 1964 ~ ?ts subsequent amendments). I ha\'~ ~o ou d res has in large part been " in the objectivity of ~ur 111 nn g p. ". t e lllve t th 'tuation through conferences and . together with a 1974 amendment of the OFCC guidelines clarifying acceptable procedures for reporting test validity. can be found in Fincher legislative ctions. the same regulations specified for tests are also applied to all other formal and informal selection procedures. th e)' 'h nsummar)' tests can e .yon~ ese. Department of Justice. In states having an approved FEPC. religion. servation human resources generally. "e aIr mp y f h legal mechanisms at the federal l~. ~v~u:s over the veal'S. in the interest of simplIficationand improved coordination. h eTVIceas had a vita mteres m d bt that the widesprea d pu bl'Ie gicaltesting methods. ISsue cease an eSlS ... Both EEOC and OFCC have drawn up guidelines regarding employee testing and other selection procedures.. and the "by the public's perception 0 f t h alrne. voluntary com~ lance. consisting of representatives of E . al developmentssince midcentury. the Guidelines make explicit reference to the Standards for Educational and Psychological Tests (1974) prepared by the American PsycholOgical Association." The Office of Federal Contract Compliance (OFCC) has the authority to monitor the use of tests for employment purposes by government contractors. No'uniform versioD.. In defining acceptable procedures for establishing validity.> I 'n of the Civil Rl?hts Act o.the preparation of a set of uniform guidelines was undertaken by the Equal Employment Opportunity Coordinating Council. ting irrelevant and unfair discrim. deed aid in the utilization an tenanceof an efficient work force an . Commissionon Civil Rights. The Equal Employment Opportunity Act prohibits discrimination by employers. -1'im\'!\. 1u!s yet been adopted.'. (1970) as an aid in the ". dF . trade unions. .. d'al programs nandicapas a necessar~' first step In reme 1< • of Testing 63 purpose: be of states enacted legislation and estlt ••• Anum.. h plal'nt and if it finds the charges . Education.o<.S. . . an cou a .the U. including rt decisions. The reader may find it profitable to review these requirements after reading the more detailed technical discussion of validity in Chapters 6 and 7 of this book.rtlJlent of Labor. executive orders. I t 0 portumty CommiSSIOn mp oymen P b' 'th the following state'entation of the Civil Rights Act. In . e. A copy of the EEOC Guidelines on Employee Selection Procedures is reproduced in Appendix B. 'ti' The\' also prOVIde a quanti ~ . prepared by the :GUldeltnes on Emp y. E 10 ment Practices CommiSSions.nfottement is vested in the sponsibility for Implementation an . -arefiled. is at the very root of the merit system. which are virtulillly identical in substance.e\ the The most pertinent federal tern th~ re?ulatio~s after the e u~tE:olo '~ent Opportunity Act (Title legislatIOnS provld... " '•.. AL REGULATIONS. dealing with affirmative action. and Welfare. e . '6 d'" first to correc e Sl to be lush e . such as educational or work-history requirements. because of their many research and training grants from such federal sources as the Department of Health.l~~' nor to the development 0 suc lIotts have been made to pat1iI0ng states that did so 7t~r. u1.S'c. .Social and Etlticallmplications ntcxt of Psychological Testing " .the U. the u. d d . (FEPC) to implement i..S. -.:. lOaon. ns-as 111 testmg aD.~ce ~: the' practicality. the Commission will defer to the local agency and will give its Bndings and conclusions "substantial weight. . r . entation of no~ d Iscnmma or. \Vhen the use of a test (or other selection procedure) results in a significantly higher rejection rate for minority candidates than for nonminority candidates. Civil Service Commission. the Guidelines point out that even when selection procedures have been satisfactorily 3 In 1973. interviews. In the final section. when used in (is also recogmzed that pro esslon~ . . . et.. It will be seen that the requirements are generally in line with good psychometric practice. d . A major portion of the Guidelines covers minimum requirements for acceptable validation (Sections 5 to 9). owever. t orders and finally bring action in hold hearings.'d in the development and " f ' b d' may sign! can 'Ii al d programs 0]0 eSlgn. " h EEOC' shgates t e com . and the U. (EEOC) When charges . rgm WI I h belief that properly validated and elin. and application forms (Sections 2 and 13). the federal courts. It is recognized that properly conducted testing programs not only are acceptable under this Act but can also contribute to the "implementation of nondiscriminatory personnel policies." Moreover. 1 'A brief summary of ~he major e~ d (1973). Colleges and universities are among the institutions concerned with OFCC regulations.t. th~ ~areerservices of ~hel~ederalt ~o\t'he:::l~pmen't and application of . its utility must be justified by evidence of validity for the job in question. sex. b 'sused in testing culturally disadvantaged ml ' When properly used. or employment agencies on the basis of race. sfme. 0 rtunity C ommlSSlon . t personne po I ' .. e an important fun~tlOn 111 pre~te~ive index of the extent of cultural . or national origin.es d f '1 EEOC may proceed to r If these proce ures al. a~ . f pIc that are related to job persityto measure charactenS!lCS 0 peo h' h' the basis for entrv .s.3 Some major provisions in the EEOC Guidelines should be noted. qual Employment ppo. • 101 ee Selection Procedures. of . .

~.motivation. sychologically. They may also be '~iniH~erson'sreluctance to apply for a job not traditionally open " ndidates. and other job-related behavior. Affirmative '~implieshat an organization does more than merely avoiding dist '.. when practicable.Such effects may include deficiencies in aptitudes. job skills.. special training programs fOI the acquisition of prerequisite PART 2 Primipus of Psychological listing knowledge.takento reduce this discrepancy as much as possible. affirmative action programs may P ded as eHorts to compensate for the residual effects of past social ~s. if disproportionate rejection rates result for minorities. ry practicCli. ~mative actions in meeting these problems include remedia most likely to reach minorities. steps e. or in his inexperience in job-seeking procedures. explicitly enminority candidates to apply and following other recruiting esignedto counteract past stereotypes.Context of Psychological Testing 'ted. and. .

Any individual's raw score is then referred to the distribution of scores obtained by the standardization sample.CHAPTER 4 NornlS a'nd the 11lterjJretation of Test Scores I N THE absence of additional interpretive data. we 67 . the raw score is converted into some relative measure. a raw score on any psychological test is meaningless. or successfully assembled a mechanical object in 57 seconds conveys little or no information about his standing in any of these functions. Like aU raw scores. and to 80 percent correct on a third. Nor do the familiar percentage scores provide a satisfactory solution to the problem of interpreting test scores.e standing in the normative sample and thus permit an evaluation of his'performance in reference to other persons. These derived scores are designed to serve a dual purpose. or identified 34 words in a vocabulary test. they provide comparable measures that permit a direct comparison of the individual's performance on different tests. of course. determine the meaning of the score. Does his score coincide with the average performance of the standardization group? Is he slightly below average? Or does he fall near the upper end of the distribution? In order to determine more precisely the individual's exact position with reference to the standardization sample. for' example. they indicate the individual's t~lativ. A score of 65 percent correct on one vocabulary test. To say that an individual has correctly solved 15 problems on an arithmetic reasoning test. First. percentage scores can he interpreted only in terms of a dearly defined and uniform frame of reference. to discover where he falls in that distribution. might be equivalent to 30 percent corred on another. The difficulty level of the items making up each test will. The norms are thus empirically established by determining what a representative group of persons actually do on the test. Second. if an individual has a raw score of 40 on a vocabulary test and a:raw score of 22 on an arithmetic reasoning test. Scores on psychological tests are mOst commonly interpreted by reference to norms which represent the test performance of the standardization sample. For example.

distribution can also be the data of Table 1 'l'n gra h. In the histogram. or onzontal axis. the tallies are counted to find the frequency. We column In the fre ua 1s an mg on another's shoulders to form the is indi~ated by a ~o th~ number of persons in each interval across from the appro n~atacef m t e center of the class interval and . or equally good in both? Since '. The individual's relath'e performance in many different . was to be substituted for an. The sums of these frequencies 'e total number of cases in the group. resembles the bell-shaped normdl e stn ution por~ayed in Figure 1 ~erfect normal curve is reproduce. c ass/1~ervals: .p erequency The s c' . . ject of statistical method is to organize and summarize )~ in order to facilitate their understanding. The difficulty level of the of est would also affect such a comparison between raw scores. as illustrated in Table l.~:~i A mathem. from 52-55 at the top of the distribution Ie frequency column reveals that two persons scored q. giving number of correct syllables substituted Inute trial. lmportant mathematical TO erti ' . . But first it . ~-:-na-= fa 1. it conveys littlestep in bringing order into such a chaos of Iaw data is to es into a frequency distribution. ssen a y t e curve . These types of ~r with some of their common variants.vesstate'd above.9ndifferent tests are usually expressed in different units. three b~tween 12 and 15.orm ° a lstnbubon curve. owever. A list of 1. 'on is prepared by grouping the scores into convenient d tallying each score in the appropriate interval.atically dete~jned.l rms 109 m common use. u ceSSlVe pomts are then meso ' Except for minor irregularities th di 'b .~s.n entered.a)'isoll such scores is impossible. The following section is included simply .ri15er atistics. terval corresponds to the g b umn erected over each class incan think of each individ n~mt erd~f persons scoring in that interval.vill be necessary to ex'elementary statistical concepts that underlie the develop'zation of norms.~ollegestudents in a code-learning test in which one set ds.~i:~YY~~' .il1lcsof Psychological Tcstillg TABLE 1 'nownothing about his relative performance on the two tests. will be considered ::tions of this chapter. in vocabulary or in arithmetic. Table 1 shows the . for this purpose and not to pro'~ statistical methods. or nonsense syllables. 34) Class Interval 52-55 48-51 44-47 40-43 36-S9 32-35 28-31 24-27 20-23 16-19 Frequency 1 1 20 73 156 328 244 136 28 8 3 2 12-15 8-11 •. Simplified .000 s ~~~ws:e~n~and 11. Fundamentally. d' th " number of ca 1 " m lcatesat'J4~ largest ses custer In the center of the range and thattlie nu.~n the vertical axis are the graph has been plotted I' teases a m g wlthm each class interval. represent purpo h tures will be noted E ti n h se.on the other hand. .examples are given onl~. . however. in each class im"erval. on the b r h' are the scores grouped int I' ase me. Jomed by straight I' . . es. can be expressed in the same units "to the same or to closely similar normative samples for . This type of curve has of statistical a~alyses FoP thP es and prOVIdesthe basis for many kinds . frequencies.meaningof certain common statistical measures. The n wo ways both fo be' . When . or number of o. For computational details and speto be ~llowed in the practical application of tl1ese techer is refeHed to any recent textbook on psychological or Frequency Distribution of Scores of 1 000 C II Stud on a Code-Learning Test ' 0 ege ents - (Data from Anastasi. a . In that form.000 be an overwhelming sight. They have been grouped '1s of 4 points. Figure 1 shows f p lC orm. ~cores. The information provided b f presented graphicallv in the f y af r~~ue~lcy. only a few fea. the hei ht of the :x. gu:e 3.between 16 and 19. eight . ranged from 8 to 52. 1934. p. ariousways in which raw scores may be converted to fulfill p.thusbe compared. derived scores )0 one of two major ways: (1) developmental level atrelative position within a specified group.

th e d" or the computation 0 mean and median.5243 47 51 55 (:1:2 ) 12.Norms and tile Interpretation of Test Scores Principles of Pbycl1010gical Testing 71 ps off gradually in both directions as the extremes are approached. the mode is the midpoint of the class ihterval with the highest frequency.24 . It will be noted that this score corresponds to the highest point on the distribution curve in Figure 1.:of central tendency. The mean is 40 .ts.28.1.n ~ 40. "'. Wi e seen that the first column in Table 2 ata 'f g lves . e y cru c an unstable. cxtrem I d d .'The most familiar of these measures is the average. . more technically known as the mean (M).16. from height and weight to aptitudes personality characteristics.44. . A third measure of central tendency is the median.< tough 111 actual practice we would rarely perform these co putations on so fe' ' T hI mard statistical sym~o~~~~t s~o~: ~ervetS adlsfo introduce certain standtO f e no e or uture reference Original raw scores are conventionally designated by a capital X d .) 50% of {:~ cases ~~ ~~1 ~! J +20 64 49 9 A group of scores can also be described in terms of some measure.40 = 4. ~ {E 3_2 400 ~ AD = }. the more closely will the distribution resemble theoretical normal curve. In genI. for it is determined by onl two scores' A smgle unu~ually high 01' low score would thus markedly Iffect its size' A :ore precIse method of measuring variability is based on the d'ff .=z:r-- ~ Illustration of Central Tendency and Variabilit)· •• ""JI --I Diff. ixj =H 40 o -20 4 1 1 ___ ~X = =~J Ixl = _ 40_ :£x' = 16 36 64 244 ~X 400 M=N=1O=40 N .n of the ou 1 erence w~t:~ P01~t it will be helpful to look at the exam~Ie Table 2 in 10 c t ~ va~ous measures under consideration have been computed on str:~~~' alu~ a s~an group was chosen in order to simplify the demon• . or most frequent score.curve is bilaterally symmetrical. Distribution Curves: Frequenc\: polygon and Histogram. or middlemost score when all scores have been arranged in order of size. etwee~ eac~ individual's score and the me. ~he most ~bvious and faniiliar way of reporting variability is in terms of e range etween the highest and lowest score The ran e h . n . The median is the point that bisects the distribution. half the cases falling above it and half below. th d f ... (Data from Table 1.40 - = ~~2 = v'24. Further description of a set of test scores is given by measures of varia. in Table 1.• 1. approximate the normal curve.48. ~"'t "f . Such a measure provides a single."i"'l1~ 1 flifkrences around the central tendency.the larger the group. st distributions of human traj. Another measure of central tendency is the mode.20- 15 19 23 Flc. As is well known. The Greek .3627 31 35 39 scores 40. with a single peak in the center. this is found by adding all scores and dividing the sum by the number of cases (N).10}~ 4 ~x' 244· .32. g.5 ~~:. V anance = SD or u 0" = -N = -10 .. used to refer to deviations of each score from the ' an a sma x IS letter I means "sum of" It 'n b group mean. being 33.. erne lan IS 405 fall'mg ml'd way b etween 40 and 41-five cases " J r~ 340 320 300 --- Frequency polygon Histogram 280 260 240 ~ 220 i3 200 '0180 •• 160 140 :l 120 100 80 60 40 20 i TABLE 2 fi!. Squared 24. . owever IS . the mode falls midway between 32 and 35.rl. Thus.5. most typical or repJi~sentative score to characterize the performance of the entire grouf:.9 . In a frequency distribution. r ~ ••• Medi.

distributions having the same mean but dlflenng In vanabllity.'g the variability of different groups. These relationships are particularly relevant in the interpreta. Nearly all the cases (99. be. which is the square root of the as shown in Table 2. Technically. without regard to sign. as will be shown in the section on stan~ard scores.ractenzed as reacbmg the sixth-grade nonn An a reading test and the t~l~d-grade n~rm in an.1(1 on both sides of the mean there are 68. Other d~velopm~tal systems uti!tze more hIghly quahtative deSCriptions of be. as shown in Figure 3. the mean would correspond to a score of 40. because t". two.9). \'ith . This p~ has owed in the last column of Table 2. On the baseline of this normal curvc have been marked distances representing one.e 2.26 percent of the cases.13.9). There is little point in a mode in such a small group. thus obtaining a measure th'eaverage deviation (AD). the Same Mean b~t Different Variahility..13 percent of the cases are likewise found between the mean and -1u.. If we Ignore slgns. and so on. In F..EOsitive and negative deviations around the m~an nec~ssarily.26'1 1 I I I I I I I I = I I I I I -30' FIC.Principles of Psychological Test ing 'eIcent) are above the median and five below. ~n a different context. The sum of thiS column diffe~ent tests in terms of norms. of e Ci. h more serviceable measure of variability is the standard devw:mbolized by either SD or u). as having a mental age of 10. ribution with wider individual differences yields a larger SD "one with narrower individual differences. 11 i~. The symbol Ix\ ill the AD formula that absolute values were summed. Sf) also provides the basis for expressing an individual's scores on . while all other ccur only once. -- Lorge SD ---Small SD Scores Frequenc\'Distributions . ar~thmetic test. so that between + 1u and .in r eviatiol1.urth-grade child may be cba.72'1 t I 95. :by the number of cases ("iN X2 ) is known as the variance.o persons obtained this score.72 percent) fall within ±3u from the mean. and column sho\\'s how far each score deviates above or below of 40. however. there is an exact relationship between the SD and the proportion of cases. The interpretation of the SD is especi~lly clear-cut when apphed to a normal or approximately normal distribution curve. and three standard deviations above and below the mean. andc~ymbo1ized by u'..8 (40 + 2 X 4. +1u to 44. to be discussed in later sections.9 (40 + 4. The percentage of cases that fall between the mean and + lu in a normal curve is 34.\1l average the absolute deviations. The variance has proved ~x'useful in sorting out the contributions of different factors to mdifferences in test performance..tion of standard scores and percentilcs. Percentage Distribution of Cases in a NOlmal Curve. howchief concern is with the SD. Because the curve is symmetrical.20 0). or cancel each other out ( + 20 .44'1 68. for e~a~~le. -leT Mean +leT +20' 3. in which the negative signs are 'ely eliminated by squaring each deviation. the AD is not suitable for use in furthema'tical analyses because of the arbitrary discarding of signs. 34. 41 would repremode. since the cases do not show c1eartering on anyone score. The sum of these deviations will always equal zero. T~us a~ 8-year-old who performs as well as the average 10-yearold on an mtelhgence test may be described. For instance. a mentally retarded adult who performs at the saifre level would likewise be assigned ~n MA of 10.. 99. In such a distribution. This measure is commonly employed in . +20' to 49. in the example given in Table 2. For the present purposes. Alf ~mne descriptive value.igur.JU~yi9I ~ . or mean One way in which meaning can be attached to test scores is to indicate how far along the normal developmental path the individual has progressed.

especially in the intensive clinical study of individuals or certain research purposeS. a~e usually found by interpolation. Thus.e. but tends to shrin~ with advancing years.. First." In age scales such as the Binet and 'revisionsjitemsare grouped into year le. .to 4 IS eqUIValent to three years of growth from ages 9 to 12.t. For exam~le. Because the school year covers ten months. the emphas1s placed on different subjects may vary from grade ~o grade. the highest age at and below w~lCh all testsare passed.. the subJect s raw scor~ 1S first determined. the indh'idual's performance shows a certal~ '~mount f scatter. and ~rogress may therefore be more rapid in oJ1e subject than III ~other dUrIng a particular grade. The mean raw seore of the 8-~ea~old children. 4. grade norms have several shortcomings. This relationship may be more GRADE EQUIVALENTS.etic processes taught In the SIxth grade. 'l'l.If a fourth-grade child obtains a grade eq. They are not generally apph cable at the hIgh school level.5 refers to average performance at the middle of the grade (Febmary testing).p~r ~evels The chIld s mental age o~ the test ISthe sum of the ba~:gp . He undoubtedly obtained'hjs sc6r~ largely by .~11then correspond to the highest year level that he can succe5sful~y 'omplete. If an ll1d~-i vidual's raw score is equal to the mean 8-year-old raw SCOre.lesof PSljchological Testing unctionsranging from sensorimotor activities to concept formation. in months. successive months can be expressed as decimals. Since mtellectual development progresses more rapidly at the earlier ages and gradually decreases as the individual approaches his mature limit.the t<:stsare employed within an academic setting. Scores on educational achievement tests are often interpreted in terms of grade equivalents. where many subjects may be studied for only one or two. It should be noted that the mental age unit does not remain constant with age. Grade norms are also subject to misinterpretation uni~s . would represent the 8-year nonn.t)Q~ninedy b the children in each year group within the standardiza~tQn' sample constitute the age norms for such a test. eIghth-grade in reading.. 4.w.·els. those items ssedbv the majority of 7-vear-olds in the standardization sample are ~jacedi~ the 7-year level. grade norms are appropriate only for common subjects taught througho~t the grade le~els covered by the test. IntermedIate grade equivalents.an arithmetic tes~ by the fourth graders in the standardizahon sample 1S23.e. All raw scores on such a test can be transformed in a similar manner by reference to the age nonns. . One year of mental growth from ages 3. th~n a raw score of 23 corresponds to a grade equivalent of 4.lvitbe:dditjonaJ months of credit earned at higher age level§. For example. representing fractions of a gr~de. let us sav. To describe a pupil s ~chlevement as equivalent to seventh-grade performance in spelhng. The mean raw scores.:. the mental age unit shrinks correspondingly with age. In such a case. For this reason. to be oinetricallvcrude and do not lend themselves well to precise statreah~e~t. ~rade ~orms are found by computing the mean raw score obtained by chIldren In each grade. the content of instruction varies somewhat from grade to grade. ' ~tal age norms have also been employed wl~h ~ests that are l:ot dldivedinto year levels. This practice is understandable becaus. readily ~isualized if •• ~ think ~~ the in.0 refers to average perfonnance at the beginning of the fourth grade (September testing). are then ~d?ed to thiS basal .~ivalent of 6~9in arithmetic. . and so forth. For example. if the average number of problems solved c~ITectly on . however. Partial credits. although Binet himself had employed the re nelitral term "mental levcl. tilose passed by the m~j~rity of 8-year-olds ~e assignedto the 8-year level. retarded at age·12. . or it may be based on time. . Such a score may be the total number of correct Items on the whole test.dividual's height as being exw pressed 10 tem1S of heIght age. scores based on developmental norms tend.years would be greater tha~ that betw~en a height age of 10 and 11.. grade-units are obv~ously unequal and these inequalities occur irregqllirly in different subjects. A child s score on the test '. a child who is one year retarded at age 4 will be approximately three. years.t does ~ot mean that he has mastered thfi aritn. for example. years. Hence. The dIfference in inches between a height age of 3 and 4. the subject fails some tests below h1s o mental age level and passes some above it. and so fOlth. Nevertheless.TAL ACE. age 5 represents a larger deviation from the norm than does one vear 'of acceleration or retardation at age 10. OWll1gto the progressive shrinkage of the MA unit. it is c~stomar}'to compute the basal age.'agefor all tests passed at hi~e:.In actual practice. For example. on number of~p"(lrs. although they can also be obtamed directly by testing children at different times within the school year. In other words.ddelv popularized through the various translations and adaptatiOns the Billet-Simon scales. one year of acceleration or retardation at. Even Vlith subjects taugkt in each grade.Norms and the Interpretation of Test Scores 75 'Prillcil. or on somecombination of sU~'h measures. In Chapter 1 it was noted that the tenn "mental ~ge" s. In other words. and fifth-grade in arithmetic has the same popular appeal as the use of mental age in the traditional intelligence tests. -I'erexp~essed. I.then hiS mental age on the test is 8 years.the test user keeps fi~ly in mind the manner in which they were ·deri~ed. i. Despite their popularity. they have considerable appeal for de\ve purposes.

1964. Thisusageof the term ordma sca ~ k l' f individuals wjthout " . Ford. ce 'I·nfouI. Piagetian tasks have been used widely in research by developmental psychologists and some have been organized into standardized scales. whereby the child is aware of the identity and continuing existence of objects when they are seen from different angles or are out of sight.Principles of Psyc11010gicaJ Testing .t.. Green. med that he has the prerequi~ites for seventh-gra e ant me I ~ adc norms tend to be incorrectly regarded as performan~l A sixth-grade teacher. d tl uential patterning of sell and his co-workers emphaSize Ie . Id • >~t . e~:~~~hl~tage presupposing m~stery of prerequisite behaVIOr chara~teJ'istic f earlier stages. Another widely studied concept is conservation.~· an extension of Guttman's analysis to Include nonlinear hi~archies i.~~:t:~n~f a~ese~lo:~e£ his associates at eAxamp e1913s et ~l. h~!:e~ pal grades. only later are empitical data gathered regarding the ages at which each stage is typically reached. & Flamer. behavior development.Norms arid the ITltcrprc:tafioTl of Test Scores 77 . hI' E Ipirica 0 serva Ion 0 research in chIld psyc 0 og~. or the recognition that an attribute remains constant over changes in perceptual appearance. visual ont of him exhibit a characteristic chronologIcal sequen:e I~ d in ion and in hand and finger movements.~". or schema. such scores are secondary to a qualitative description of the child's characteristic behavior. Insofar as these scales typically provide information about what the child is actually able to do (e.lingui~~c dc~~~~~. •• . 1969. 1947.: t~:: he thumb in opposition to the palm. n . . An :pprformancet one 1 a eve mlp I o Since the 19605.st!ir ~limbing. between them' in the statistical sense.' o .:therange' of achJevement test scores will inevitably exten over . d 'h fc I. Loretan. tee h I s reac I . . With such norms. H~lver( mes" . 1971).areas of behavior. motor. 1963. g . dge about amount of dilI~r~nce les Ordinal sillIes of child development arecontra.~ilescribc:d by Bart and Airasian (1974). In summary. the individual's performa. In accordance with Pia get's approach. they share important features with the criterion-referenced tests to be discussed in a later section of this chapter.'Oresmay he reported in terms of approximate age levels. 1968b. namely. 1940. climbs stairs without assistance. It certam Iy COU not lOrpenorman .t all class should fall a! or close to tl~e sixth-grade . thIS t)~e 0 pre en~. as when the same quantity of liquid is poured into differently shaped containers.' inati0t. a ~ h'ld' 'behaVior with 1 Tliese levels are "found by companng tIe CIS h k a ran ing from 4 weeks to 38 mont s. in which the attainment of one stage is contingent upon completion of the earlier stages in the development of the concept. In this respect.df. ? . with special reference to Piagetillrr··~al. Pinard & Laurendeau.stedto equal-umt mterva ~:m~~ scale or simplex. 0 1 omotion sensory typical of successive ages in . t pincerowedb use of the thumb and index finger In a more e c~en .~ . An example of such a concept. Gesell & Amatruda. do I' the sense that developmental stages follow In a .al scale IS simp y one . there has been a sharp upsurge of interest in the developmental theories of the Swiss child psychologist. An DINAL SCALES. for example: may assume tha. 1944). to be discussed in Chapters 10 and 14 (Goldschmid & Bentler. typlCalof eight ey at>es. and ~ost ~f th. Th ey CIe e. " . in which successuallydeSignedon the ~o~~.n?rm In ac rade ests This misconception is certamly not surpnsmg when g h iare ~sed Yet individual differences within any onc grade ar~ suc ·. f behavior of developmental sequences and an orderly pdrogressllolll ~ect piaced b Old' fons towar a sma 0 ] h Iges. . is object permanence. ordinal scales are designed to identify the stage reached by the child in the development of specific behavior functions. h d I h th apprOXImate 1933 The Gesell Developmental Sc e U es s 0''0 ) e h ff r lopm~ntallevel in months that the child has attained in eadc 0 °aul d ptive lan<1uage an person .g. I I" differs from that in statistics.. i Nearly all standardized tests now provide some foryn of within~group norms.k t of the first few years.' I that permlt~ a ran -oruenn~ 0 . or when rods of the same length are placed in different spatial arrangements. Y f the ob'ect Such sequential patterning was hkewlse ob- l~:~e~ 0 0 J • 0 0 ~g~ cg~~. in which an '. o~1 . 1966.so:u:c:ss at I lower levels (Guttman. Uzgiris & Hunt. . He is concerned with specific concepts rather than broad abilities. is evaluated in ..~~. 1975). in which items are selected in the first place on the basis of their differentiating between successive ages.th grade arithmetic. Use of th~ entire an 'de attempts at palmar prehension OCC~Il'St a~ ear~er ~ i~h. these instruments are ordinal scales. s~~~~:~l~:o~':~:~~. The ordinality of such scales refers to the uniform progression of development through successive stages. Jean Piaget (see Flavell. The tasks are designed to reveal the dominant aspects of each developmental stage. The scales eve ope ~ 'c6nstant .SUC uncti~ns as OCt forma~ion. Piaget's research has focused on the development of cognitive processes from infancy to the midteens. .For example. Ginsburg & Opper. .seq.• . Although sc. 1 d t the description of be'pment in infants and young chlldl. the procedure differs from that followed in constructing age scales. f'f''t d xtenslVe eVidence 0 um or1111 . .n e.wOalking. 1 t developmental norms derives Another approac 1 0 1 b t' f behavior . recognizes identity in quantity of liquid when poured into differently shaped containers).

. already dlsd as a measure of central tendency. percentile scores can also provide a correct visual pictUre of th~ differences between sc. expressed in terms of the percentage of correct /items.(PH"')' These percentiles. Moreover. as when comparing a child's raW score with that of ~hi~dren of e same chronological age or in the same school grade. Percentile Ranks in a NOlmal Distribution.50).9 ~m FIC. A~ithmetic probability paper is a cr~ss-se<:. It IS apparent that percentiles show each indiyf<Jual's relative position In the normative sample but not the amount of <h~ence between scores. In a normal curve. ~ ~ ~ 4.:ple would have a percentile rank of zero (Po). those below 50 signify inferior p~rforman:e. These percentile ranks are given under the graph m Flgure 4.1i~. however. Percentile scores are expressed in terms of the percentage persons in the standardization sample who fall be~ow a given raw reoFor exampk.centile . he latter are raw scores. m FIgure 4). spaced . one hl~her than any . except th~t m rankmg ustomary to start countin<1 at the top. are.. If plotted on arithmetic probability paper. Percentiles above 50 represent e-average performance. This distortion of distances between scores can be seen in Figure 4.J{iiduafs relative position in the standardization sample. Even more stdking is the discrepancy between these distances and that between a PH of 10 and PR of 1.nes are uniformly spaced. Percentiles should not be confused with the familiar percehtage scores. (In a mathematically derived normal curve. we begin ing at the bottom. Thus..84). normal curve. Consequently.' e 50th percentile (P. They are easy to compute and can be readily understood. whether it measures aptitude or personahty vanables. -30- -10- M +10- +2098 +3099. . Wlthm-group reshave a uniform and clearl\' defined quantitative meaning and can appropriately employed in m~st types of statistical analysis. That between + I. it will be recalled. A percentile indicates ~he to . In Figure 4. Like the median. ~ercent~les . expressed in terms of perce~ltage of }<persons. the poorer the 'dual's standing. any glYen percentage of cases near the center covers a shorter distance on the baseline than the same percentage near the ends of the distribution. however. They can be used equally well with adults and children and a~e sUit~ble for any type of test. . If the distribution of raw scores approx1mates the normal curve. '. The same relationship can be seen from the opposite direction if we examine the percentile ranks corresponding to equal u-distances from the mean ~f a. ) Q1 Mdn 1 Q3 20130405 06070180 99 i J I 1 i I I : I I I i I I I 1 I I ~ I I I I I I 1 I I I : I I J I I \ I I I : I \ I I I I I 1 I I I 1 I I \ I I I I I ributions.__________________________ •••••••••••• ·1 Norms and tile Interpretation of Test Scores 79 Principles of Psychological Testing msof the performance of the most nearly comparable standardization up. PERCEKnLES. at ~he extremes of the distribut~on.(I) corresponds to the medlan. the percentile difference i.letween the mean and + lIT . . 'With ~ercentiles. as lS true of most test scores. Percentile scores have several advantages. then a raw score of <j<\rrespdnds the 28th percentile (P~~).rl?npaper i~ which the vertical h. or vice versa (as in Figure 5)... score in the standardization sample would have a percentile rank of 100. the best person m the group 'ing a rank of one. even by relatively untrained persons. especially. if 28 percent of the persons obtam fewer than 15 bblemscorrect on an arithmetic reasoning test. this discrepancy in the gaps between percentile ranks (PH) can readily be seen if we compare the dj$tance between a PR of 40 and a PH of 50 with that between a PR oero and a PR of 20.is 34 (84 .:)\150 be regarded as ranks in a group of 100. because they cut off the lowest and highest quarters the distribution. Such normqJpe.percentiles are derived scores.25th and 75th percentile are known as the first and thlrd quartile hits (Ql and Q3). The chief drawback of percentile scores arises from the marked 10equality of their units. . so that the lower the percentile.nes. they provide convenient landmarks Qrdescribing a distribution of scores and comparing it with other dis- whereas raw score differences near the ends of the distribution are greatly shrunk. on the other hand. then raw score differences near the median or center of the distrihution are exag~erated in the percentile transformation. cases cluster closely at the center and s~atter more widely as the extremes are approached. raw score lower than any obtained in the stand~rdizahon samA .in t?e same W~y asltM'percentile p~~nts in a normal dlstnbubon (as. do not imply a zero raw score and a perfect raw score. and +~is only 14 (98 . zero percentile is not reached until infinity and hence cannot be shown on the graph. percentiles are universally applicable.. whereas the horizonta. or~s.

the illinterscoredifference will be correctly represented~ Many aptitude achievement atteries now utilize this technique in their score prob 'whichshow the individual's performance in each test. Standardscores mav be obtained by either linear or nonlinear transationsof the origi~al raw scores. "AXDARD SCORES.60 = +1.~:. scoreswhichare the most satisfactory type of derived score ftom most ~oints' view. Table 3 shows the computation of z scores for two individuals. ard scores" or "z scores.40 SD below the mean. All-properties of the original distribution of raw scores are duplicated in the distribution of these standard scores. one of whom falls 1 SD above the group mean. ressed as 400 (500 .Whe~ found by a l. For this reason. reprod in Figure 13 (Ch. Thus a standard score of -Ion this test would b: . ." To compute a :. ". usethey are computed by subtracting a constant from each raw score thendividing the result by another con~tant The relative magnitude . the occurrence of negative values and of decimals.00 Both the abovE'conditions. Current tests are making increasing use of standard. An example ~eIndividualReport Form of the Differential Aptitude Tests. it is Simplynecessary to multiply the standard score by the 5 t . To con"er~ an origi~$ll!tandard score to the new scale. and an SD of 100. the other . because the total range of most groups extends no farther than about 3 SD's above and below the mean. theyretain the exact numerical r~labons of the ongmal raw scores. 5).aw scores.eartransforma. Compare the sc~re. entile difference is 5 points. simply to put the scores into a more convenient form. ~x~lnple.100 = 4(0). John Mary Ellen Edgar Jane Dick Bill Debby TABLE ~h-A Normal"PercentileChart. score.Norms and the Interprdation of Test Scores .:lly applied.in.. .5 X 100 = 650). distance ~ed " hn and Mary with that between EIIen and Edgar. the scores on the Scholastic Aptitude Test (SAT) of the College Entrance Examination Board are standard scores adjusted to a mean ot.S ou1ltcorrespond to 650 (500 + 1. .60 Zl= X:=58 58 . Similarly. Jane and Dick differ by 10 percentile as do Bill and Debby. It is apparent that such a procedure will yield derived scores that have a negative sign for all subjects falling below the mean. Standard scores express the individual's distance from of meanin terms of the standard deviation of the distribution. such standard scores will have to be reported to at least one decimal place in order to provide sufficient differentiation among individuals. some further linear transformation is u~u.Moreover. withollt any distortion of results.". on the same r thescoresof the same person on different tests. Any raw score that is exactly equal to the mean is equivalent to a z smre of zero. any computations that can be carried out with the original raw scores can also be carried out with linear standard scores. For this reason. w!.thm both pal:s. In elther case. viz. X\=65 65 .pfillciIJles of Psychological Testing 81 of differences between standard scores derived by such a linear transformation corresponds exactly to that between the . . Linearly derived standard scores are often desilTnatedsimpl\' as "standb . a standard score of + l. we find the difference between the individual's raw score and the mean of the normative group and then divide this difference by the SD of the normative group. tend to produce awkward numbers that are confusing and difficult to use for both computational and reporting purposes. Percentiles are spaced so as to ~orrespond ~~I istancesin a normal distribution.For. 3 Computation of Standard Scores X-M SD JOHN'S SCORE BILL'S SCORE "'canbe used to plot the scores of different persons.

Another well-known transformation is represented by the stanine scale. NQrmalized standard scores are standard scores expressed in terms of a distribution that has been transformed to fit a normal curve. All such measures are examples of linearly sformed standard scores. Thus. Although under certain circumstances another type of distribution may be more appropriate. This scale provides a single-digit system of scores with a mean of 5 and an SD of approximately 2. Any other convenient values can be arbitrarily chosen for the . and a s(:ore of + I. :1.. the number corresponding to each deSignated percentage is first computed.same percentage of persons in both distributions.I SD above the mean. and these numbers of cases are then given the appropriate stanines. the next 12 a score of 3.. thl:"'t'fugh physical operations.00 might exceed only 50 percent of the cases in . which use equal-unit scales derived. as can be seen by reference to the bottom line of Figure 4. consisting of 11 units and also yielding an SD of 2. and tl.produced in Table 4. Like linearly derived standard scores. the percentage of persons in the standardization sample falling at or above each raw score is found. The mental age and percentile scores described in earlier sections represent nonlinear transformations. His score exceeds aproximately t1J.. If the normalized standard score is multiplied by 10 and added to or subtracted from 50. with a mean of zero and an SD of 1.913" Ch."one distribution is mal'kedly skewed and the other "normal. signines that the individual occuies the same position in relation to both groups. This. Other variants are the C scale (Guilford & ltruchter. the 4 lowest-scoring persons receive a stanine score of 1. In order to achieve comparability of scores from dissimilarly shaped .a z score of +1.'\: ". is a debatable point that involves certain questionable assumptions. On this scale. an e(lual-unit scale could be developcd for psycholo~ical measurement similar to the equal-twit sL-dles physiof cal measurement.~al :~rve is that It has many useful mathematical properties. ard score is obtained. are converted to a distribution with a 1 of 10 and an SD of 3. 1951}. a score corresponding ~. viz. if tlJ. Normalized standard scores are expressed in the same form as linearly derived standard scores.3 The name stanine (a contraction of "standard nine") is based on the fact that the scores run from 1 to 9. however. e linearlv derived standard scores discussed in the preceding section " be cO~lparable only when found from distributions that have apximately the same form.'IO'ogical Testing 'ed SD (100) and add it to or subtract it from the desired mean ). A score of -I means thafhe surpasses approximately 16 percent of the group. thus being e~Werto handle quantitatively. for each score requires only a Single column on computer punched cards. by normaliZingraw scores. Moreover. The restriction of ~cores to single-digit numbers has certain computational advantages. excelling 50 percent of the group. Anoth'1f"frnportan: advantage .negroup but would exceed 84 percent in the other. with s 5 units above and 5 below the mean (Canfield. for instance. a type of score first proposed by McCall (1922).e group consists of exactly I()() persons. developed by the United States Air Force during World War II. and so forth. normalized standard scores can be put into any convenient form. When the group contains more or fewer than l00~cases. it has frequently been argued that. TABLE 4 Normal Curve Percentages for Use in Stanine Conversion Percentage Stanine Raw scores can readily be co~verted to stanines by arranging the original scores:in order of size and ~. and is percentage can be determined if the form of the distribution is 'known.han~es in the percentages and yields an SD of exactly 2. 19). These percentages correspond to a distance of 1 SD below and 1 SD above the mean of a normal curve. but they are subject to other limitations already discussed.~fn assigning stanines in accordance with the normal curve percentages"re. respectively. the next 7 a score of 2. One of the chief reasons for this chotee is that most raw score distributions approximate the normal CUJ. howeyer. For example. physical me1tsures such as height and weight.mean and SD. that he surpasses 84 percent. This percentage is then located in the normal curve frequency table.V-e more closely than they do any other type of curve. and so on.If. Such scoreS can be computed by reference to tables giving the percentage of cases falling at different SD distances from the mean of a normal curve.nonlinear transformations may be employed to fit the scores to any specified type of distribution curve. a normalized score of zero indicates that the individual falls at the mean of a normal curve.Principles of P~Y.y yield normal ~istributions. it is converted into a T score. for example. whlchl""faclhtate further computations.e. 'twill be recalled that one of the reasons for transforming raw scores o any derived scale is to render scores on different tests comparable. Scores 011 the separate subtests of the Wechsler Inence Scales.!~ ~lO-Uilit tefl scale.of the ~or. "'c -" 3 Kaiser (1958) proposed a modification of the staninl!'scale thaq~volves slight (. the normal curve is usually employed for this purpose. Firsf. Co . generaU. Under such conditions. a score of 50 corresponds to the mean. a score of 60 to 1 SD above the mean.distl-ibutions. and the con-esponding normalized stand2 Partly for this reason and partly as a result of other theoretical considerations.

An IQ of 70 has been used traditionally as a cutoff point fpl' . Table 5 shows. Chiefly for this reason.IQ was ply the ratio of mental age to chronological age. There are therefore certain practical advantages in the use of a derived scale that corresponds to the familiar distribution of Stanford-Binet IQ's. the meaning of any given IQ on his test will be quite different from its meaning on other tests.. the ratio IQ has been largely replaced by the so-called deviation IQ. ifficult to constmc:t tests that met the psychometric requiremeritS' .refor the majority of purposes. Such aIJ. 70. Although the :ods of deriving these two types of scores are quite different. Hence. Deviation IQ's are also used in a number of current group tests of intelligence and in the latest revision of the Stanford-Binet itself. Whenever feasible. it prm'e. may indicate the same degree of superiority as an IQ at age 12. but it _". An IQ of 100 thus represents '\i. of their respective age distributions. .servethe same purposes as normalized st." and in the fact that such scores can be interpreted as IQ's provided that their SD is approximately equal to that of previously known IQ's. An IQ of 115 at age r example. there are nevertheless certain techal objections to normalizing all distributions routinely. A major technical difficulty is that.stanines are being used increasingly.~ DEVIAT10JlO IQ. especially with aptitude achievement tests.7 percent (SD = 12) or as many as 5. In actual practice.\ or average performance. the ratio IQ (Intelligence Jient) was introduced in early intelligence tests. If a test maker chooses a different value for the SD in making up his deviation IQ scale. Since Stanford-Binet IQ's have been in use for many years. and so forth. the linearly derived standard scores the normalized standard scores will be very similar. " apparent logical simplicity of the traditional ratio IQ. The deviation IQ is a standard score with a mean of 100 and an SD that approximates the SD of the Stanford-Binet IQ distribution.1 cent when the SD is 16 (as in the Stanford-Binet). cut off.. 90. 8 would be assigned a stanine of 1 (4 percent of = 8). This value should. multiplied by 100 to 'pate decimals (IQ 100 X MAjCA). Such a correspondence of score units can be achieved by the selection of numerical values for the mean and SD that agree closely with those in the Stanford-Binet distribution. the mean is 100 and the SD 15. With 150 cases. since both may fall at a distance of 1 SD from th~ . Ithough nOlmalized standard scores are the most satisfactory type of .Prillciplcs of Psycl1010gical Testing Norms and the Interpretation of Test Scores 85 us. Bartlett and . for example. if an SD close to 16 is chosen in reporting standard scores on a newly developed test. the resulting scores can be interpreted in the same way as Stanford-Binet ratio IQ's. \Vith the increasing use of deviation IQ's. it is important to remember that deviation IQ's from different tests are comparable only when they employ the same or closely similar values for the SD.1 percen . his IQ will be exactly 100. it fluctuated around a median value slightly greater than 16. IQ's below 100 indicate retardation. unless the f the IQ distribution remains approximately constant with age. testers and clinicians have become accustomed to interpreting and classifying test performance in terms of such IQ levels. J = for comparability of ratio IQ's throughout their age range. the linearl\' derived standard scores . In these tests.!proeessof normaliZing a distribution that is already virtually normal r produce little or no change. Such a trans:)ation should be carried out only when the sample is large and repIltative and when there is reason to believe that the deviation from in~litvresults from defects in the test rather than from characteristics he sample or from other factors affecting the behavior under conration/it should also be noted that whpn-the original distribution of scores approximates normality. y as few as 0. These discrepancies are illustrated in Table 5. however. ying . In an effort to convert ~1A scores into a ~6rm of the individual's relative status. For any group containing from 10 to 100 cases. the tiltingscores will be nearly identical under such conditions. acceleration.ndard scores.st per3.out of 200 cases.. 130.These SD values have actually been employed in the IQ scales ofp*lJli~hed tests.t of the llifficulty' level of test items rather than ~by subsequently alizing a markedly nonnormal distribution. proved deceptive. if a child's ~IA Is his CA.' = 18) . Such IQ's are not derived by the same methods employed in finding traditional ratio IQ's. always be reported in the manual and carefully noted by the test user. that an IQ of 70 cuts off the lo\v(j:. will not be comparable at different age levels. They have learned what to expect from individuals with IQ's of 40. They are not ratios of mental ages and chronological ages.erton (1966) have prepared a table whereby ranks can be directly rted to stanines. above 100. which is actually another variant of the familiar standard score. Among the first tests to express scores in terms of deviation IQ's were the \Vechsler Intelligence Scales. it is generally more 'rable to obtain a normal distribution of raw scores by proper adjust.. Obviously. which shows the percentage of cases}i1normal distriblltions with SD's from 12 to 18 who would obtain IQ's at different l~els. The justification lies in the general familiarity of the term "IQ. ObViously. Although the SD of the Stanford-Binet ratio IQ (last used in the 1937 edition) was not exactly constant at all ages. Because of their practical as well as theoretical rimtages. 6 would receive a stanine of 1 (4 percent of == 6). It should be added that the use of the term "IQ" to designate such standard scores may seem to be somewhat misleading.&' . With an approximately al distributiou of raw scores.

2 15.89 70.2 26.6 Tscore 0.8 7. and so forth. an IQ of 132 corresponds to a standard score of +2. depending on the ~est chosen. I 5 I 10 I I I I I I I 20 30 40 50 60 10 80 90 95 !l9 Distribution. Relationships among OiHerent Types of Test Scores in a Normal INTERRELATIONSHIPSWITHIN-GROUPCORES. of course.5 SO = 18 z '" \Rh 0. OF S A cussian of derived scores. In connection with the last point... To be sure.8 .13% 0.0 21.1 15.Bmet WIll show that these IQ's can themselves be interpreted as standard scores.17% 2 3 . Wechsler deviation IQ's (SD = 15). a Stanford-Binet ratio IQ of lI6 corresponds to.r. Ha~court Brace Jovanovich. +1.12%.1 16. Similarly. Finally. of scores.this chapter.5 3.0 = L 10 I I I 20 30 40 .~~"""""~ CEEB score I 200 Deviation IQ (SD =15) 300 - I I I 400 500 I 600 700 800 mental retardation. : 1Q1ilterval s'. Ratio IQ's on any test will coincide with th~g_iven deviation iQ scale-If they are normally distributed and have an S1). Moreover. IJlay include as few as 42 percent or as many as 59. 6.stanines.20%! 11% 112% 17% I 5 6 7 8 4% Percentile FIC. ..4 z score I -4 ! I -3 -2 I -I I I +2 +3 I +4 I 15. an IQ of 76 to a standard score of -1. There are still enough variations among cuaently available tests.2 29.6 29. 4 +1 I 50 I ! GO 100. ). we can conclude that an IQ of 1I6 falls at a distance of 1 SD above the mean and represents a standard score of + 1. because in a normal curve 84 plirc~1it of-the cases fall helo.0 -.3 1.3 16.0 26. standard s(:ores have. test publishers are making efforts to adopt the umform SD of 16 in new tests and in new editions of earlier tests.8 6. Any other . 130 above 120-129 ··:110-119 100-109 90. The same discrepancies. to make the checking of the SD imperative.2 4. become IQ's and vice versa.13% -40-10Mean Test score SD = 16 3. Percentiles ~ave gradually been taking on at least a graphIC rese~b~a~ce t? norma}ijzed standard scores. College Entrance Examination Bqp..1 7.1}52.'1II9tA~.1 8. T SCOres. If we know that the distribution of Stanford-Binet ratio IQ's had a mean of 11") ronrl ~n qT) of :mnroximatelv 16.°l 15.79 . a ree. . ! 55 I I 10 85 I 100 I 115 130 I 145 Stanine 4% I 7% .6 '0 E 0.rcd (CEEB) scores. Inc. and percentil~s. of 15. In Figure 6 are summarized the relaHbnships that exist in a normal distribution among the types of scores so far discussed in .00.6 percent of the popula-tion. 15.8 .5 tage of Cases at Each IQ Interval and Different Standard Deviations in Normal Distributions with Mean co v esyTest Department.r I 100.3 5.5 21.3 6.00 SD (Figure 4).4 8.b .7 100.9~ 80. apply to IQ's of 130 and above.50. which might be used in selecting children for special programs for the intellectually gifted.0) 420 .7 4.5 5.S} 59. t this stage in our dis. These include z scores. The IQ range between 90 and lIO.0 +1<1 +2<1 +3<1 +4<1 15.) 5: In . generally described as normal. however.xamm~tion of the meaning of a ratio IQ on such a test as the Stanford-. Linear standard scores arc mdlstingmshable from normalized standard scores if the original distribution of raw scores closely approximates the normal curve.1 100.00. Below70 Total SD= 12 SD = 14 1. the reader may have become aware of a rapprochement among the various types.~Percertile rank of approximately 84.0 I 70 I 80 90 ..::}47.

First. then an IQ of 120 corresponds to '1 SD. each of these scores can be readily translated into .Principles of Psychological Testing ally distributed IQ could be added to the chart. "What tests did he take on these three occasions?" The apparent decline may reflect no more than the differences among the tests. then an individual who received an IQ of 112 on the first test is most likely to receive an IQ of 118 on the secon~.'omposition. is restricted to the particular normative population from which it was derived. In that case..-ertain advantages they offer 'th regard to test construction and statistical treatment of data . If the school records show that Bill Jones re. he would have obtained these scores even if the three tests had been administered within a week of each other. group froin which the sample 1Sdrawn. for instance. ~ve migh~ test a carefully chosen sample of 500 10-year-oJd boys attendmg PUb~IC schools in several American cities. If a schoolchild's cumulative record shows IQ's of 118. Still another example involves longitudinal comparisl?. The test user should never lose sight of the way in which norms are established.lr~de t(t'Qbtain a representative cross sectiol\Hlf. ceived an IQ of 94 and Tom Brown an IQ of 110. are more likely to be overlooked. ~ost pes of within-group derived scores. but similarly constituted.~15 consti~tmg the~i\r.. !hird. considerable attention should be. Test ~corescannot be properly interpreted in the abstract. tests may differ in content despite their similar labels. one of these tests may include only v~rba] content. If the SD is 20. howeyer. Another. In st~tistjca] terminology.. and ease of developing nonns. given to the standardization sample. however expressed. should always be accompanied by the name of the test on which it was obtained. the examiner might erroneously conclude that the individual is much more able along verbal than along spatial lines.ch was given in his respective school. an IQ of 80 to -1 SD.. the exact form in which scores are reported is dictated gelyby convenience. Similarly.•same population should not yIeld nonns that diverge appreciably frorp tfl. the scale units may not be comparable. The positions of these two students might have been reversed by exchanging the particular tests that eq.or penn~ne~t. numerical. For example. Standscores in any form (including the deviation IQ) have generally placed other types of scores because of c. An IQ. they must be ree ferred to particular tests. 115. and 101 at the fourth. In conclusion.ation sa'!!Ples used in establIshmg nonns for different tests may vary.!U. provided we know 'SD. if we wish to establish nonns of test performance for the population of 10-year-old. Lack of comparability of either test content or scale units can usually be detected by reference to the test itself or to the test manual.• Any norm. ethnic (. univer. The sample would be checked w1th reference to geographical distribution. Although common]y descnbed by the same blanket term. when the reverse may actually be the case. another may tap predominantly spatial aptitudes. and other relevant characteristics to ensure that it was truly representative of the defined population. Let us s~ppose that a student has been given a verbal comprehension test and a spatial aptitude test to determine his relative standing in the two fields. ObViously. however. while the spatial tes~ was standardized on a selected group of boys attending elective shop courses. In choosing such a sample·. a Norms and the Interpretation of Test Scores 89 ISTERTEST COMPARISONS. . It is. Differences in the respective normative samples. Th~ latter des1gn~tes the larger. the first question to ask before interpreting these changes is.apparent that the sample on wh1ch the norms are based should be large enough to provide stable values. the same indi~idu~l will appear to have performed better when compared with an mfenor group than when compared with a superior group. fifth.~ardization sample. and so on. Second.. socioeconomic level.. urban. They JIle~ely represent the test performance of the subi. ny of the others. an individual's relative standing in di~erent functions may be grossly misrepresented through lack of comparability of test norms.ose obtained. a distinction is made between sample and populatIOn. familiarity. .. the composition of the s~dardi. There are three principal reasons to account for systematic variations among the scores obtained by the same individual on different tests. In the development and application of test norms. If the verbal abilitv test was standardized on a random sample of high school students. carefully derived and properly interpreted. public schoo] boys. When certain statistical conditions are met. such IQ's cannot be accepted at face value without further information. and sixth grades. THE NORMATIVE SAMPLE. Th: former refers to the group of individuals actually teste (i. and still another may cover verbal. or allY other score.. similarly chosen sample of th•. af1 eff?rt IS usual. Such differences probably account for many otherwise unexplained discrepancies in test results.it~st is designed. if IQ's onone test have an SD of 12 and IQ's on another have an SD of 18. So-called intelligence tests rrovide many illustrations of this confusion.the populatIon for which th~. As explained earlier in this chapter. are fundamentally s1m1lar _. and spatia] content in about equal proportions.ns of a single individual's test performance over time. Psychological test norms are in no sense absolute.

dycomparable to each test to be norme?. This approach has been followed to a limited extent by so~e test publishers. "'hat is required is a batterY of anchor tests. all administered to the same national sample. tables are then prepared g1Vlllg the corresponding scores On the Project T~LENT composite and on the particular test. however. No single anchor test. e.psychotics." and the like. For example. regardless of content.population under consideration. the rate of ctiveelimination from school is greater for boys than for girls. Shaycoft.000 students in grades 9 through 12. with a large sampling error would obviollsly be of little yalue in ~erpretationof test scores. Through correlational analysis. the use of natIOnal anchor norms would appreciably reduce the lack of comparability among tests. owing to e progressive dropping out of the less able pupils. for example. Th~ Pro!ec~ TALENT battery has been employed to calibrate several test battenes III use by the Navy and Air Force (Dailey. Ideally. The special limitations of these samples.. could be used in establishmg norms for all tests. who have prepared equivalency tables for a few of theIr Own tests (see. however. 1964) so far come closest to providing such an anchor batten' for a high school popula~ion. Consequently. or mental retardates. m which scores are considered equivalent when ther have equal percentiles in a given group.one way of ensuring that a sample is representative is to restrict the population to fit the ~ecifications of the available sample. a . Such tables are designed to show what score in Test A IS e~Ulvalent to ~ach score in TestB. "~- . the relative proportion of severely rearded persons will be much greater in institutiunal samples than in the total population. the desired population should be definedin advance in terms of the objectives of the test. mental retardtes with physical handicaps are more likely to be institutionalized than re the physically fit. uallyimportant is the requirement that the sample be representative '.prisoners.ffectdifferent subgroups equally. For several other battenes. achIevement. th~ lllVeStIga. see Angolf (i~~. it is far better to redefine the population more narrowly than to report norms on an ideal population which is not adequately represented by the standardization sample. For example. One solution for the lack of comparability of n~rms IS to use an anchor test to work out eqUivalency tables for scores ?n dl~erent tests. . Then a suitable sample should be assembled. 1966a). however. Obvious]y. the samples obtained by different test constructors often tend to be unrepresentative of their alleged populations and biased in different ways.Lf:NT com4 F~r an excellent analysis of some of the technical difficulties involved in efforts to achIeve score comparability with different tests. A ber of such selective factors are illustrated in institutional samples. interest. data have been gathered to identify the Project TA. if the population i$ defined to include only 14-year-old schoolchDdrenrather than all 14-year-old children. such . For example. 1966. will yield an in'singlysuperior selection of cases in the sllccessive grades. if the 80th pel:' centile in the same group corresponds to an IQ of lI5 on Test A and to an IQ of 120 on Test B.Because of many special factors that determine institutionaliza'n itseH. No test provides norms for the human species! And it is doubtful whether any tests give truly adequate norms for such broadly defined populations as "adult American men. the rr<llJtin~norms are not comparable. and /~greater in lower than in higher socioeconomic levels. Practical obstacles in obtaining subjects. In such a case.g. 1962: ~haycoft. Testing subjects in school." "lO-year-old American children. Nor does such iffiinationi?.A-IQ 115 is considered to be equivalent to Test-B-IQ 120. national normative sample (Len~on. 1971a).eloped tests ·can ~ever be regarded as completely interchangeable. or institutionalized mental redates. then a school sample would be representative. brat~n~ each new test against a single anchor test. but it would not elimi. nate it."Prillciplesof Psychological Testing . NATION~L ANCHOR NORMS. the sample unrepresentative should be carefully investigated. In actual practice. Lennon. & Dailey. Each ne. it must be recognized tItat l~dependen~ly dev.torsadministered a two-day battery of specially cons~ructed aphtude. This can be done by the equipercent. The data gathered in Project TALENT (Flanagan et a!'. very fe''''' tests are standardized on such broad populations as is pORularly assumed. ausesuch samples are usually large and readily available for testing oses. ample. More ambitious proposals have been made from time to time for cali. of course. may make this goal unattainable. By means of the equipercentile method. of course. & Orr. should be careyanalyzed.they offer an alluring field for the accumulation of normative .ze m. 1966b). Hence.~' ~est could then be checked aKainst the most nearlY similar anchor test 111 the battery. Subtle selective factors that might . For ex.such groups are not representative of the entire population of riminals. 1962). S~I~ctive factors likewise operate in other institutional samples. Neyman. and temperament tests to appr~:llnately 400.patients in mental hospitals. At best.ethod. Similarly.composite of Project TALENT tests is identified that is most n~ya. Closely related to the question of representativeness of sample is the needfor defining the specific population to which the norms apply. The general procedure is to admllllster both the Project TALENT battery and the tests to be calibra~ed to the same sample. Using a r~ndo~ sample of about 5 per~nt of the high schools in tIllS country. then Test. Even with the avail~bihty of anchor data such as these. which has itself been admllllstered to a highly representative.

and sixth-grade schoolchIldren were exammed 111 50 states. the comparison of a child's relative achievement in different subjects. a score of 500 on any form of the SAT corresponds to the mean of the 1941 sample' a score of 600 falls 1 SD above that mean. SPECIFIC NORMS. or the measurement of an individual's progress o\-er time. Qwing to the differential operation of selective f~ctors. o. For many test~ng <. hrough an unusually \vell-controlled ~xpenmental desl. ' . type.gn. The use to be made of the test determmes the ~pe of differentiation that is most relevant. 197. 111 t~r. Mention should also be made of local norms. As the number and variety of College Board member colleges l~lcreased and the composition of the candidate population changed.dUring a particular year. These batteries include the General AptItude Test Battery 'ofthe United States Employment Service. A college admissions office may develop norms on its own student population. From statistical analyses of all these data. 1965. score eqUivalency "tablesfor the seven tests were prepared by the equipercentile method. In still other groups.Ofparticular interest is The Anchor Test Study conducted by the EducationalTesting Service under the auspices of the U:S. the Differential Aptitude Tests. Another approach to the nonequivalence of existing norms-and probably a more realistic one for most tests-is to standardize tests on more narrowly defined populations.ver 300.3). Office of E~uqation(Jaeger. grade. lum.tohave separately reported subgroup norms. soclOeCOnO~T1lc 'level and manv other factors. an employer may accumulate norms on applicants for a gIVen type of job within his company.andthe Flanagan Aptitude Classification Tesfs . the limits of the normative . each child took the reading comprehension an~ voca?ula~ subests from two of the seven batteries. the n?rms " might be said to apply to "employed clerical worke~'. In the equating phase of the "d)'. highly specific norms are deSirable. One type of nonnormative scale utIlIzes a fixed reference group in order to ensure compar~bility and continuity of scores. It IS often helpful .1965). It was concluded that scale continuity should be maintained. considered a?ove. Or a single elementa~y school may evaluate the performance of individual pupils in terms of Its own sco:e distribution. for which new norms cre established in one phase of the-project. Thus. as well as whether general or specific norms are more appropriate. :hus. After 1941. Although most derived scores are computed m such a way as to provide an immediate normative interpretation of test perfom~ance. Thus. Local' or other specific norms are often used for this purpose. These local norms are more appropriate than broad nahonal norms for many testing purposes. other interested persons (Loret. elementa~ schoolchIldren. In such ca. norms are available for a broadly defined populatIon.of curnc~. sex. Each new form is thereby linked to one or two ~arher forms. 1971b). without providing normative evaluation of performance. purposes. in order to control for order. This is true whenever recog-. One of the clearest examples of scaling in terms of a fixed reference group is provided by the score scale of the College Board Scholastic Aptitude Test (Angoff. The subgroups may be formed with respect to ag~. such as the prediction of subsequent job performance or college achievement. SAT scores were expressed on a normative scale. . SA~ at certam .ms o~ the mean and SD of the candidates taking the test at each adm~mstration.000 candidates who took the test m 1941. \Vith such a scale.•nizable subgroups yield appreciably different scores on a particular ~est.~ Principles of Psychological Testing Norms alld the Intcrpretation of Tcst Scores 93 . Eve~ w~e~ representatIve . The anchor test consisted of the reading comprehension and vocabulary btests of the Metropolitan Achievement Test. each battery bemg paned In turn with every other battery. An even more urgent reason for scale continu~ty ~temmed from the observation that students taking the. These nonnormative SAT scores can then be mterpreted by comparison with any appropriate distribution . Cooley & Miller. organizations" or to "first-year enginee~ing students. Between 1926 (when this test was first a~ministered) and 1941. all the pamngs were 'duplicated in reverse sequence. an individual's score would depend on the characteristics ot the group tes~ed . there. Otherwise. The groups employed in r11'ridnrt s11ehnorms are even more narrow I)· defined than the subgroups • FIXED REFERENCE GROUP. so chosen as to suit the specificpurposes of each test.•. and so forth.. which in turn are linked with other forms by"g chin of Items extend!ng back to the 1941 form. This study represents a systematIc effort to proVIde comparable and tI'uly representative national norms for the seve~ most 'dely used reading achievement tests for.• subo :testsfrom the same battery. To permit translation of raw scores on any {prm of the SAT into these ~x~d-refere~ce-group scores. . urban or rural envIronment.. all SAT scores were expressed in terms of the ~ean and SD of the approximately 11. ~re some notable exceptions. therefore. Some groups took parallel forms of t~~ t\.ses. 1962. often developed by the test users themselves within a particular setting. A manual for interpreting scores is provided for use by school systems and .s 111 large busll1~sS '. These candidates constitute the fixed reference group employed in scaling all subsequent forms of the test. population should be clearly reported wIth the norms. Bianchini. & Vale. .fifth-. of administration. normative interpretation requires reference to independently collected norms from a suitable population. .positecorresponding to each test in the battery (Cool~y. Seder. a short anc~or test (9r set of common items) IS lI:c1uded 111 each fonn. geographical region.000 fourth-.hmes of the year performed mOre poorly than those ~akll1g It at other bmes. 1974).

preclsely th~ same considerations applv to other units of measurement-the mch. It will be noted that the principal difference beh":een the fixed-reference-group scales u~der consideration and the previously discussed. candidate o ulation. t nt scale--which in the case of a multiple-form testmg program. scales ~ased on natlOn~1 anchor norms is that the latter require the chOIce of a. test scores are usually incorporated in the computer data base. ~~susefulness derives from the fact that it remains the same ~ver time and allows us to familiarize ourselves with it. Individualized interpretation of test scores at a still more complex level is illustrated by interactive computer systems. and interpretation. g. t ce of a should be' of no consequence. Essentially. 1973). g or derivation of the foot does not lessen Its usefulness to us In a~y "ay. and so on.tnt' facts and relations in answering the individual's questions and aiding him in reaching de-. Examples of such interactive computer systems. it is likely that for many testing purposes such broad norms are not required. ~:g COMPUTER UTILIZATION IN THE INTERPRETATION OF TEST SCORES Computers have already made a Sig~i~cant. . to be discussed in Chapter 17. M. are now adapted for computer scoring (Baker.ogical measureme.tovided by the student or client.can be detected only with a fixed-score scale. more useful in making colle. the computer program associates prepared verbal statements with particular patterns of test responses. as well as independent test-scoring organizations. 1970). In the field ofpsych?l. . the Differential Aptitude Tests (see Ch. such as diagnostic scoring and path analysis (recording a student's progress at various stages of learning) have barely been explored. optical scanning equipment available at some scoring centers permits the reading of responses directly from test booklets. are equipped to provide such scoring services to test users. In such a situation.bout educational programs and occupations. most current tests. As Baker (1971.. At a somewhat more complex level. with the ~1innesota Multiphasic Personality Inventory (MMPI). certain tests now provide facilities for computer interpretation of test scores. sconng. Similarly. Super. cisions. smgle group that IS broadl representative and appropriate for normative purposes. 32--33) writes: There is hardly a person here who knows the precise origina~ definition of ~he I gth of the foot used in the measurement of height or distance. For example. . ~Vhat is of consequence IS t e ~am enan .Norms and the Intcrpretat. more~v~r. t 1e provlSl~n 0 up. At the simplest level. 1970. p. test users may obtain computer printouts of diagnostic and interpretive stl. such as that of a particular college. Although separate answer sheets are commonly used for this purpose. h . In the present connection. 'mportant however are the adoption of new procedures and F ar more 1 " . Any changes in the candidate populatlOn o. and especially those designed for group administration. These statements are typical of what a counselor would say to the student in going over his test results in an individual conference (Super. . Apart from the practical difficulties in obtaining such a group and the need to update the norms.ver time.ge adml~slon decisions than would be annu~l norms based on ~he entire. nt decisions.. the mile. we shan examine some applications of computers in the interpretation of test scores. there is no one here who does not know how to. In such cases. the computer com~thes all the available information about the individual with storedt-t' ". In this connection. The obvious uses of computers-and those develop~d earliest-represent simply an unprecedented increase in the spe~d WIth which traditional data analyses and scoring processes can be earned out. Several test publishers. reportmg..ion of Test Scores 95 "94 Princil)les of Psychological Testing of scores. This technique has been investigated with regard to educational and vocational planning and decision making. Our ignora~ce of the precise on.\tements about the subject's personality tendencies and emotional condition. Needless to say. . A. f s pIe achieved bv rigorous form-to-form equati~g-an . th: de ree of Fahrel1h~it. Holtzman. The latter contains verbal statements that combine the test data with information on interests and goals given by the student on a Career Planning Questionnaire.gmal me~n.. or'nlative data to aid in interpretation and III the formation of specific men t alY n . on t~e other hand. Angoff (1962}pp. speed. d't' . 227) SUCCinctlyputs It. a type of college.lev.. This approach has been pursued with both personality and aptitude tests. together with the numerical scores. etc.upon eve? phase of testing. data which would be revised from time to time as con I lOllSwalla . ii!' various stages .impact . 13) proVide a Career Planning Report. 'lar]y reasonable to say that the original defimtlOn of the scale IS or IS Slml . together with other inforn:tation . a r~gi?n.. 1973. in which the individual is in direct contact with the computer by means of response stations and in effect engages in a dialogue with the computer (J." Various testing innovations resulting from computer utilization will be discussed under appropriate topics throughout the book. R.nt It . and it utilizes all re. Scales built from a fixed reference group are analogous m one respect to scales employed in physical measurement. which includes a profile of scores on the separate subtests as well as an interpretive computer printout. Katz. computer capabilities should serve "to free one's thinking from the constraints of the past. 1971). Harris. 1974. evalm~te lengt s and distances in terms of this unit. IS cons a·. d 1 . These specific norms are. and d~ta-processl~g ('~n:lhiliti('s of computPTS. or which it was whose foot was originally agreed upon as the standard.' h' h ld the exploration of new approaches to psychological testmg w lC wo~ have been impossible without the fle:dbility. . Many innovative possibilities. from test construction to admlmstrahon.

these systems. or the chances of his achieving a designated performance level on an external criterion (educational or vocational). during. 1 d IBM's Education and Career Exerationaldevelopment. and prescribe subsequent instructional procedures.I [ .Dlag correcr the specific learning . The previously cited Project PLAN and IPI are examples of such programs.J. Finally.. I .URE AN~ . 1969' Glaser . the student may 'Pg matenal. :. although it is not the most appropriate term.~~~ninclud~s a progr~m of self-knowled?e. criterion-referenced tests are useful in broad surveys of educational accomplishment.~~ 'tionpackages or more ~onventlOn:l t~: rather formidable mass of 'utionof the computer IS to proces f f each student in a '1 d' g the per ormance 0 . . Typically. computer-manage .~ "f .1\1" ' II E \Ii lill~: . as in qualifying for a driver's license or a pilof s license.''C'(' .~~~~u~~:'nT~e i~~~vi~ualize~ il~struct~~n~f~:~~.r to a reme~l~l branc . CONTENT MEANING.. and other individualized. in which an individual's score is interpreted by comparing it with the scores obtained by others on the same test. 'ded by the l' t' of computers are PIOVI . Prehmmary e na s. "Criterion-referenced. ion in 1earmng IS .vanced m:te~:r~~~ he receives instruction in more . illustrates criterion-referenced testing. . PrillcijJles of Psychological Testing of Test Scores 97 I . In suc~ syst~~~~t::me:ter is to assist the teacher in i. a~d fi ld I show good acceptance of ation (SIGI).omW lere eac I.severa ' .. h testing that has aroused a surge of USES. ma . computer-managed.1975). testing for the attainment of minimum requirements.' s S 'stem for Interactive Guidance !:ion System (ECES). selfpaced instructional systems. w .. mewhat1010asl~~native terms are in common use.~. On t e aSlSo. & GI . Gronlund (1973) provides a helpful guide for this purpose. and after completion of each instructional unit to check on prerequisite skills. and objective-referenced. Examplesof thi. ~l. familiarity with the concepts of criterion-referenced testing can contribute to the improvement of the traditional. not on how he compares with others. an occupa Ion . s for Researc~e~t ..particularly 1~ education'd1sbygGlaser (1963) this term is still d .Norms and the Interpretation . ceumulateddal y regar m. Pro) d t' al planning 'as well as instruction aualdevelopment.H!·(_~T!~({' :: .:. such as content-. In criterionreferenced testing. nts (Harris 1973). 1 I mer does not interact directly leton. A brief but excellent 'discussion of the chief limitations of criterion-referenced tests is given by Ebel (1972b). systems high school stud~nts and thel roPfart~e by 1 data utilized in t an mtegra par t t results aIso repres~n I) I der to present instructional titer-assisted instructwn (CAd .' xt instructional step for each these data in prescnbmg the ne -.n( 1 t. 1 d' dl'fferent activity and to 'I I Y be InvOve In a ' . mc~T. such as the National Assessment of Educational Progress (\Vomer. criterion-referenced testing uses as its interpretive frame of reference a specified content domain rather than a specified population of persons. diagnose possible leaming difficulties. t ch stu ent s curren 1 appropnate 0 ea d I ate the student's responses to 'ermust r. ver. n appro~c t~ enerally desi<Ynatedas "criterion~ J. . " Fnst propose '. it has been contrasted with the usual normreferenced testing. .. '<TERION-REFERENCED TESTING '. Prominent among these are computer-assisted. the estimated size of his vocabulary. nostic anal sis of errors may lead taryprereqUIsItematenal.Sapp lCan~~iduallY Prescribed Instruction-see Jsityof Pittsburgh s IPI (1 ) d' Pro)'ect PLAN (Planning for 1968 an . In all . "entaryand high school subjects. !lr. domain-. Thus far. In this respect. The major distinguishing feature of criterionreferenced testing (however defined and whether designated by this term or by one of its synonyms) is its interpretation of test performance in terms of content meaning.. " 1 d b the Amencan I ni~gin Accordance with Needs) deve ope SYh Brudner & I 1971' Flanagan anner. the . as well as ~ simple and well-balanced introduction to criterion-referenced testing.. These terms are sometimes employed as synonyms for criterion-referenced and sometimes with slightly differ~nt connotations.1111: : used CAI system for tE':lching '( n-1 \ F. 0' reading to first-. testing is closely integrated with instruction. C. and in meeting demands for educational accountability (Gronlund.U~~~'~eu::. to further practice at the present edto more ad. From still another angle.\'s. 1974). criterion-referenced testing has found its major applications in several recent innovations in education. informal tests prepared by teachers for classroom use." however.del)' r a descnptlOn 0 a \\ 1 • " 1 ' \ ch'Ll---.epeated~' s~or..instructionalprogram desIgned to ltiesidentified in individual cases.nstruction (CMI-see . for example. f 'ble variant of computer ' t' ally more eaSl ss costly an d opera Ion d '. From another angle. an examinee's test performance may be reported in terms of the specific kinds of arithmetic operations he has mastered. aser.hat the person can do and what he kno'.ea~ hi~::~onse history. 1970). . . :\t1:. the difficulty level of reading matter he can comprehend (from comic books to literary classics). The focus is clearly on u. I 'and its definition varies among different wnters.I r \:11' . )lee testmg.1974).n!1 1.'~.~ n or t le\'el of attainment. seems to have gained ascendancy. being introduced before. A funda- .

has ~r has not attained the preestablished level of mastery . A three-way distinction may also be employed.bjectives. employed in the previously cited programs fo~ l~dlvlduahzed mstructIon. It should be noted that criterion-referenced testing is neither as ne~' ! 6ldeaUy.ty are. to formulate highly speCIfic obJectIves for advancedlevels of howl edge in less highly structured subjects. Such an interpretation can certainly be combined with n?rmreferenced scores. however. On the other hand. in its emphasis on content meaning in the interpretation of test scores. appreciation. interests. C~rroll. insh'uctional objectives can also be arranged m an ordmal archy. . 1965). cnbcal thinking. While providing appropriate norms at each level this batt~ry ~eets three important requirements of criterion-referenced .hty and \'al.~Afterhe instructional objectives have been fonnulated. S. When basic skIlls are tested. Cooley & Glaser. VVhen stated I~ these general terms. Exam~le~ of such tes~ mclude the Prescriptive Reading Inventory and Pres~np~lve Mathem~tlCsJnventory (California Test Bureau). and ~iagnosis: An Instructi onal Aid Series in Reading and In Mathematics (ScLCnceResearch Associates).. these objectives run to several hundred for a smgle school . An example is the 1973 Edition of the Stanford AchIevement Test. Beyond basic skills. as do the PiaF:etian ordinal scales discussed earlier in this chapter. It follows t. the results of criterion-referenced testing could derite into an idiosyncratic and uninterpretable jumble. 1963. some educators have argued that. MASTERY TESTING. and an intermediate doubtful.g.or "identifies the misspelled word in which the final e is re. Hence as generally constructed cnter~on-refer~nced tests minimize indh'idual differences.to the IQ. Further sum of these pomts Willbe found in Chapters 5. clearly defined . irSCIlIl. Under these conditions ' complete ma St ery IS un. content ~vel:a~e m~y p~oc~ed in many different directions. Gagne. the critenon-referenced approa~h IS equivalent to interpreting test sCOTesn t~e light of the demonstra~ed i validity of the particular test.! 1 " rinciplrs of Psychological T ('sting equirement in constructing this type of test is a. For example: they lnclude items passed or failed by all or nearly all examinees although such. or "review" interval. such tests follow the simplex model of a Guttman scale (see Popham & 1T1Isck. on such e to have communicable meaning. indicating that the Norms and tIle Interpretation of Test Scores 99 indiVidual. 1970. achievement is open-ended." In the programs prepared for in?ividualized ion.' Mas:er~ t. and 8.d unnecessary. 1968.estin? IS r~gularly.for example. rather than m terms of vague underlymg entities. 9(9). At these '. Essentiany. and originality.d'. A second major feature almost always found in criterion-referenced testing is the procedure of testing for mastery.6 It is impr~eticab~e a?d probably ndesirable. ' In connection with individualized instru('tion. both thc content and sequence of learning are likely to be much 'moreflexible. Moreover. rea lStiCan. llciHQIlUI context these units correspond to behaviorally defined 6nal~. criterion-referenced testing may exert a salutary effecton testing in general. Some published tcsts are so constructed as to permit both norm-referenced and criterionrefe~enced applications. including mastery. The ll1dlvJ~ual m~~ progress almost without limit in such functions as understandmg. criterion-referenced testing is best adapted for ng basic skills (as in reading and arithmetic) at elem~ntary le~e1s.hen addl~g -ing. B. the content domam to be ~lust be widely recognized as important. however. Hence norm-referenced evaluation is generally enlployed In such cases to assess degree of attainment. . this procedure yields an all-or-none score. If scores. e -consuming. suitable for elementary school. the acquisition of more elementary skills being prerequisite :the acquisition of higher-level skills.the mdl~I~~al s abllibes. Individ~al differences would thus be manifested in learning hme rather than In final achievement as in traditional educational testing (Bloom. depending upon . mastery testing is inapplicable or insufficient. nearly complete mastery is generally expected (e. individual differences in perfo~m~nce are of httle or no interest. 80--85% correct items). r . 1969. heseareas. as well as local instructional factllties. The interpretation of intelligence test scores. This procedure is admittedly difficult . the usual methods for findin tdtlJio ~.inapplkahle to most criterion-referenced tests. adequate coverage of each obJective WIth appropriate items.ests: speclflc~tlO~ of ~etailed instructional objectives. nonmastery. ] . To describe a child's " intelligence test performance in terms of the specific intelJech~al skills and knowledge it represents might help to counteract the confuSIOns a~d misconceptions that have become attached .hat In mastery testing. however. ite~ns are usually excluded from no~n-referenced t~sts. In more. _. The selected domain subdivided into small units defined in performance terms. and goals. would benefit from this approach.of this reduction in variability. items are t d to sample each objective. ad~'~nced and less structured subjects. J.en strictly applied..ievels.t. 'such as "multiplies three-digit by two-digit •. and wide range of item difficulty. It is also characteristic of published cr~tenon-referenced tes~ for basic skills. The Skills M:omtor~ng System in Reading and in Study Skills (Harcourt Brace o\'anovlch) '. : As a resl~lt. given enough time and suitable instructional methods nearly ~veryone can achieve complete mastery of the chosen instructio~al obJ:etives. \Vithout such careful specification and control of . f knowledge or skills to be assesscd by the test.

if a student tains a score of 530 on the CEEB Scholastic Aptitude Test. il. Based on a pilot y e Ir orcc. To describe an individual's level ding comprehension as "the ability to understand the content of • ~ett. _tions. by imposing rm cutoff scores on an ability continuum. ~hus. a. Strictly speaking. p. is used in the APA test PECTANCY ndards (1974). what are e chances that hislreshman grade-point average in a specific college ill fall in the A. . Other examples may _ in early product scales for assessing the quality of handwriting. 'd The correlation between test d ~lOn.:~~~~. 1974 by The Psychological ~'-=-== -r----=--r:--. 2). 171 high school boys en 'II dl~ ata for thIs table were obtained from ' ro e m courses in Am' h' Ictor was the Verbal R' encan Istor)'. score mterval. represent the best criterion grade. Under probability of success oP fa"I y hart can be prepared.:---Percentage Receiving Each Criterion Crade Test Score ~umber of Cases 46 36 6 Below 70 70-79 15 39 80-89 22 39 21 90 & above 40 & above 30-39 20-29 Below 20 63 --= 43 46 12 30 63 17 5 52 17 TABLES. furtherthat the concept of mastery in education-in the sense of all-orearning of specific units-achie\"ed considerable popularity in the and 19305and was later abandoned. (Angoff. For exam 'f an mdlVldual WIll receive a given 34 (i.::. F Igure 7 is an -example f selection battery developeod ~\h a~. cntena can be dicliotomized. and so on.se cof study. dlVlded into four " ' er 0 s u ents whose f 11' .referenced. th I eren la Aphtude Tests y m e course. om1ativeframework is implicit in all testing.Y. C. the term "criterion-referenced testing" uld refer to this type of performance interpretation. into "sucthese conditions. out of 100. Evaluating an individual's test performance in absolute ch as by letter grades or percentage of correct items. and Course Grades in America H' t f en DAT lerbal Reasoning Test n IS ory or 171 Boys in Crade 11 (Adapted from Fifth Edition Manual for . the probability of his obtaining a grade betwee ~~ove In many practical situation n. mto each mterval table indicate the pe t' f emall1l1lg entnes m each row of the rcen age 0 cases 'th' h who received each grade at th d f h WI III eac .e:b e course. as in a training program or on a job. Such a choice presupposes information about what persons have done in similar situations. 1974). Within the estimates of the probabilit ~ha~tthese. n expectancy table gives the probability of different criterion outroesfor persons who obtain each test score. More precise attempts to test performance in terms of content meaning also antedate the lion of the term :'criterion-referenced testing" (Ebel. 'f the number of cases in each cell of sueh a bivariate distribution is Changedto a percentage. is certainly .e" in the 30--39 inte~. as when a test is said to be validated against a particular criterion Ch. 15 percent re' percent grades of 80-89 d 63 gra d es of 90 or above At th th ' an percent below 20 on the test '30 e 0 er e~treme. pp. mastery testing does not 'by eliminate individual differences. showing the r I ure corresponding t 'h . in fact.~ 1973. D.lirt shows. I Norms and the Interpretation at Test Scores TABLE 6 Expectancy Table Showing Relation betwe . while the other proaches discussed in this section can be more precisely described as tent. T.r undertak~ng. 52 percent b limitations of the a~ai~ble dPtercent between 80 and 89. ' h 0 eac . est scores may also be interpreted in terms of T eeted criterion performance.:. conclude that the probability r 17. New York. ~n 9 IS S9'~ of 100.66. regardless of how . an e~ e~. or othe. class intervals' the numb f t d t e test SCOles. 1962. B. All right~~~. are expressed.rinciples of Psychological -/ Testing clearly divorced from norm-referenced testing as some of its ts imply. 0: ? lS8 . er than normative interpretations. s usage of the term "criterion" follows standard psychometric prac. The very choice of content or to be measured is influenced by the examiner's knowledge of what e expected from human organisms at a particular developmental or ctional stage. thIS expectancy en. The pred easomng test of the D'ff t' I . or F category? This type of information can e obtained by examining tbe bivariate distribution of predictor scores SAT) plotted against criterion status (freshman grade-point average). ll~. scores an crltenon was . Reproduced by permission th~. For example. such as the The first column of Tahle 6 shows h . cess" and "failure" in a 'ob ' s. or drawings by matching the individual's work sample f a set of standard specimens. the result is an expectancy table. Forms Sand Corporation. The crite . Moreover. N.York Times" still'leaves room for a wide range of indi\'idual erencesin degree of understanding. of the 46 students celved grades of 70-79 22 al Reasomng test. test-score interval wi~h scores of 40 or above ~e . 69-70).:i/ ':e n~w t~udent receives a test score of of his obtaining a grade of 90 ~ _" ou . f 'II 101 one I ustrated in Table 6 Tl d . This terminology. IS gIven in the second column The r " scores. DIfferential Aptitude Tests. of the 46 students scoring ' percent receIved gr d b I 7 etween 70 and 79 a d 17 a es e ow 0. 1968.. Ebel (1972b) observes.' . p~rcentages. administered earl .ex:ectanc~ chart.l962-see also Anastasi."as en -of-course grades.

Despi~e optimum testing conditions. rapport.444 32. The characteristicsof thiss~mple should therefore be specified.975 '23.129 39. R l' bet "een Performance 7 Expectancv Chart ShowlI1g e atlon \.699 11. 9 f. • . °t e:amPco~e of 4 win fail and 'I t d t who obtain a s an me s 40 percent 0f pI 0 ea e s '1 1t 'marv flight train. test reliability indicates the extent to which individual differences in test scores are attributable to "true" differences in the characteristics under consideration and the extent to which they are attributable to chance errors.209 2. decreases consIstent y over ". IectionBattery an IIDlllaI '.ofthe men recelVlDg a stamne 0 . 1 'd f the validitv of a test in pre Ictexpectancy charts give a genera 1 ea 0 . the test is designed to measure more permanent personality characteristics. Po the basis of this . 19 f 1were eiiminated in the course of train. To put it in more technical terms. every test should be accompanied by a statemellt of its reliability. measures of test reliability make it possible to estimate what proportion of the total variance of test scores is error variance. lng. p. of Men 5 9 8 7 6 21. or with different sets of equivalent items. no test is a perfectly reliablei~strument. n ' . . 1947. Such a measure of reliability characterizes the test when administered under standard conditions and given to subjects simil!lT to those constituting the normative sample. "I . Essentially. f Ie that approximately r . however. If.398 34. the same daily fluctuations would fall under the heading of error variance. Thus. lies in the definition of error variance. time limits. when the examiner tries to maintain uniform testing conditions by controlling the testing environment.CHAPTER No. an could be ma. . d E1' ' fan from Primary Flight Trall1JUg. l' W ht trammg can e . however.reia: d t eh~ receive each stanine. f t t es it can e seen a d' tatlol1 0 es scor. . LIABILlTY . 'led to complete the " 1I 1 4 t of those at stamne aJ. on the other hand. day changes in scores on a test of cheerfulness-depression would be relevant to the purpose of the test and would hence be part of the true variance of the scores. 1 the succeSSl'\'e stamnes.trcm . m lVI ua s 60'40 or 3:2 chance of completing individual wIth a stamne o. or under othel: variable examining conditions. ing a given criterion. if we are interested in measuring fluctuations of mood. Between these ex.'thin each stanine on the battery who the pertentage of men scormg 'It b seen that 77 percent . Thus. any condition that is irrelevant to the purpose of the test represents error variance. This concept of reliability underlies the computation of the error of measurement of a single score.139 Reliability 5 R •• 3 2 904 G. Hence. Besldebs provldmthg t both expectancy tables and . and other similar factors. chance factors. together with the type of reliability that was measured. For example. instructions. Similar statements . a criterion-referenced interpreprimary flight training.atisf~ctor:':b~~~i~ye o~~~cces~ and failure m ing. The concept of test reliability has been used to cover several aspects of score consistency. W Ii c on y percen es the ercentage of failures training satisfactorily. The crux of the matter. he is reducing error variance and making the test scores more reliable.. .(From Flanagan. Factors that might be considered error variance for one purpose would be classified under true variance for another.de about.f 4 has . whereby we can predict the range of fluctuation likely to occur in a single individual's score as a result of irrelevant. . then the day-by. .failedto camp :t: pnmary.474 19. 58. .. expectancy chart.. d 1 :v refers to the consistency of scores obtained by the same persons when reexamined with the same test on different occasions. it ~uld be predlcte .tpproximately 60 percent wil1:. In its broadest sense.) ~ .

l) are distributed along ~~ diagonal running from the lower left. the next section will consider some . In this case. 0 i"'P'? 0 "0 0 -0 """ Score on Variable J Bivariate D' t'b' f ISn utlOn or a Hypothetical Correlation of +1. It will be noted that. of test scores will be examined.00 ). e avel acre 111 the second d f would be no regularit}. Figure 9 illustrates a perfect negative correlation ( -1.. a correlation coefficient (T) ex~ssesthe d'egree of correspondence.e. '01' relationship. as as more detailed specifications of computing procedures.j/ff .~t.~.00). sh others 11lIght be above the . a ..position in both variables...of course. Since all types of reliability are con-with the degree of consistency or agreement between two in deBy derived sets of scores. u '~f -h b' . .\:B1e: (vertical axis). they can all be expressed in terms of a lion coefficient. cae su lect s score:()n .j/ff I OJ . or below ave. For example 1'£ t' rom e way in which the scores are ex. Accordingly.. or hivariate distributiOflt. however.."l. and if the process Under these conditions l't -. alzderbo near~zero correlation would result. there is a complete reversal of scores from one variable to the other. extremes.~ ('()mnlete "bsence of rdationship. be as many varieties of test reliability as there 'lions affecting test scores. ~ ncn there would be a perfect correlation between variables 1 and 2. since it shows that each individual occupies the same relative . o N N 0.O. I T 0. '1.~G I.. ~. test represents the number of bl . If each individ l' out of a hat to determine his 'f' tl. a y resu t.00. More technical discussion of correlation. Ime scores are correla't d 'th negative correlation wl'11 prob bl I Th .Jifflll I I ! ! ! i ! : . W Ie IS Score on an 'th . having some value 'h~ ~1 p~actIce generally fall between these lations between measures of t an zero but lower than 1. The closer the bivariate distribution of scares approaches this diagonal. CorrefrequentlY low When a a I.e. the principal techniques for measuring the .. an mehc reasoning ' pro ems correctly soh d ..iud/Iles of Psychological Testing could. It will be 2 noted that all of the 100 cases in thee grolJ. This diagonal runs in the reverse direction from that in Figure 8.Thus. valla I" mlg t scar I' I I In variable 2. if the top-scoring individual in variable 1also obtains the score in variable 2. this reversal being consistently maintained throughout the distribution.to the lower right-hand comer. WOu e ImpOSSIblet d' . in the relate: h' f ' an so orth.. relative standing in variable 2 from k 0 pre. A hypothetical illustration of a perfect positive correlation is shown in igure 8.~. es IS recor ed as the xi' b f qmred to complete all itenls h"l h' '.' 'Uers might ~all above average average in one and at th ' .j/ff' 4/It. ". 0 M ()o. ch tally mark in this diagram indicate~~~e score of one individual in th vllriable 1 (horizontal axis) and vain. ~" a negative corre.j/ff! JItt. since any such conditions might be _for a certain purpose and would thus be classified as error varie types of reliability computed in actual practice. In this chapter. 9 9 . The best individual in variable 1is the poorest in variable 2 and vice versa. CIa case. b' such variables.Jifflll m N I :g 60-69 "5 ~ 50-59 o ~ 40-49 oX : :.'the diagram.basic characteristics of conelation caefficients.a s n~me were pulled at random were repeated for variabl~ C) pOSI IOn m vanable 1.to. and so on down to the poorest individual in the group. i . it usually results f th a IOn IS a tamed between two pressed.00. aIthoug'h " negative conel t' .'theupper right-hand corner of . can be d:in any elementary textbook of educational or psychological statissuch as Guilford and Fruchter (1973).rlways positive. stin variable 2.pm er a secondsre· . all individuals fall on the diagonal extending from the upper left.er.. are few.. Ict an 1l1dividual's 1.00.lIes are nearly . .Jiff. together with the sources of iance identified by each. WI amount scores. Such a distribution indicates a perfect positive correlation (+ 1. i I i I : i ~ . the second-best individual in variable 1 is second . 1ahon can be expected In su I h .1.. MO.a~e~l~gb~th~ ~hance ~core above average in In one variable and below in the oth .. such as fEAl\. an anthmetic computation t t' d s. might occur by chance. The top-~oring subJ'ect I'n "bl a1 n~whledge of IllS score in variablE! . There lOns Ip rom one i d' "d I TI Ie coefficients found in a t I' n 1\I ua to another. slowest) individ- ~1.. ow. ~r .".j/ff II ~4H1Hff iiNt I . between two sets .. the higher will be the positive correlation. t e poorest (i. 2 OF CORRELATION. or average ~oth variables. ! --- JHt mr II I ! .. This figure presents a scatter diag~\lm. Essentiallv. . in order to clarify use and interpretation.scores. I 0.. Some individual 'h b h e ug I. uch a correlation would ha\'e a value of + 1.j/ff1 .. in this scatter diagram.t' .

n decreases consistent y over '. f t t scores it can e seen a d' tatlon 0 es . f I that approximately r expectancy chart.699 11. Similar statements re~a~ m~ hp i each stanine. The characteristics of thiss~mple should therefore be specified. For example. 1 the succeSSl'\'e stanmes.:2 chance of completing . lies in the definition of error variance. Ig f 1were eiiminated in the course of trainof the men receIVing a stamne 0 . . if we are interested in measuring fluctuations of mood.' . Hence. measures of test reliability make it possible to estimate what proportion of the total variance of test scores is error variance. d l:~:::~ J refers to the consistency of scores obtained by the same persons when reexamined with the same test on different occasions. °t e~amPcoe~eof 4 will fail and s 40 percent 0f pI'J0 t ca d e t s who obtam a s amne 1 . or under othel: variable examining conditions. Thus. To put it in more technical terms.' I 'd f the validitv of a test in pre lctexpectancy charts glVe a genera 1 ea 0 ing a given criterion. the test is designed to measure more permanent personality characteristics. then the day-by-day changes in scores on a test of cheerfulness-depression would be relevant to the purpose of the test and would hence be part of the true variance of the scores.398 Reliability 34. Expectancy aT p. ejectionBattery and Elimination from I1maly n . The concept of test reliability has been used to cover several aspects of score consistency. an could be made about. test reliability indicates the extent to which individual diHerences in test scores are attributable to "true" differences in the characteristics under consideration and the extent to which they are attributable to chance errors. Ch t Showmg R el'atIon bet \\'ee . whereby we can predict the range of fluctuation likely to occur in a single individual's score as a result of irrelevant. the same daily fluctuations would fall under the heading of error variance.) : . with a.975 '23. or with diHerent sets of equivalent items.209 2.the percentage of men scormg \\ I . . In its broadest sense. indlvldua s w 6~. individual.444 32. every test should be accompanied by a statement of its reliability. W 1 C on y 11 percen es the ercentage of failures h'aining satisfactorily. Despite optimum testing conditions. however. and other similar factors. Po the basis of this . on the other hand. It b seen that 77 percent . no test is a perfectly reliable instrument. . Besldebs pro th t both expectancy tables and . chance factors. Thus. 1947. If.7. J' fI' ht trammg can e ailed to comp :t: pnmary. flight trainitpproximately 60 percent wil1. of Men 5 9 S 7 6 5 4 3 2 21. s~amne o. Flight Training.~~c:rv. it ~uld be predlCt.474 19.f 4 has ~idin' a criterion-referenced interpreg primary fhght trammg. rapport.e . LIABILITY . p..139 904 R . Between these ex. together with the type of reliabIlity that was measured.thin each stanine on the battery who . time limits. Performance IG. Factors that might be considered error variance for one purpose would be classified under true variance for another. This concept of reliability underlies the computation of the error of measurement of a single score.129 39.atis~~~tor~'~b~~~i~ye~f and failure ing. when the examiner tries to maintain uniform testing conditions by controlling the testing environment. The crux of the matter. instructions. 9 f 'led to complete the '1 I 4 t of those at stamne al ing. any condition that is irrelevant to the purpose of the test represents error variance.{FromFlanagan. however. Essentially. he is reducing error variance and making the test scores more reliable. Such a measure of reliability characterizes the test when administered under standard conditions and given to subjects similllr to those constituting the normative sample.CHAPTER No.trcm ". 58.

~ . sh others mIght be above the ld b e a\el:lge III the second and f h wou . in this scatter diagram./Iff i#ff .. It will he noted that. since any such conditions might be for a certain purpose and would thus be classified as error varie types of reliability computed in actual practice. they can all be expressed in tcrms of a 'on coefficient./lff I lilt I I i ! i . t e poorest (I. 1f~!ch sublect s score'On . or bivariate distrihutiOl/.gbt hY chhance score above average in . be as many varieties of test reliability as there . I II j i . \Vou e ImpOSSible to d' t d relative standing in variable 2 from k pre. 8.p~achce generally fall between these lations between measures of t an zero but lower than 1. :. if the top-scoring individual in variable 1 also obtains the op score in variable 2.. t 9 i . In this case. the second-best individual in v-ariable 1is second ~stin variable 2.in any elementary textbook of educational or psychological statis. A hypothetical illustration of a perfect positive correlation is shown in igure 8. ~r .! I b t") . This figure presents a scatter diag~ll. Some individ I 'h b g t Score 11lgh. however."l.. having some value ~~ ~'l .~:'~' .t . There The coefficients fOund in t I ~ one mdl\ Idual to another.t'fl'? Si On (). T j ! '-- '" v . IC an in ividual's ~. althoug'h . .00. or average both vadables or below av:. am.00.<:. Figure 9 illustrates a perfect negative correlation ( -1. the higher will be the positive correlation. Th './lff !mr . between two sets cores.t~.~ . in order to clarify use and interpretation..1 .. such as EA!\'ING OF CORRELATION.Thus. 'brll there would be a perfect correlation between variables 1 and 2. . Bivariate Distr'b t' f I U IOn or a Hypothetical Correlation of +1. ot ers mightf II b 111 one variable and below in the oth .00).. II 0- I . ach tally mark in this diagram illdicated~e score of one individual in 'oth variable 1 (horizontal axis) and vUllable: 2 (vertical axis). since it shows that each individual occupies the same relative position in both variables. 0(""') 0. a an anthmetic computation te t . The best individual in variable 1is the poorest in variable 2 and vice versa.a a Ove average average in one and at th " . ../Iff! ./Iff III ! ./Iff.jtionsaffecting test scores. negative con-el t' . ./lff' . (+1.. d S IS recor e as the dumb f d qUire to complete all items wh'l h' '~er a secon sre· I e IS Score on an arith t' t es t represents the number of' hI '''. s pOsitIOn In vanahle 1 a d 'f th were repeated for variable" ' n I e process Under these conditions it -. alzderbo near~zero correlation would result.low./lffl ! . such as Guilford and Fruchter (1973). Such a distribution indicates a perfect positive correlation . ". I .a case. ()o. as ·as more detailed specifications of computing procedures. might OCcur by chance If each ind' 'd I' out of a hat to determ'ine hi 1I. Since all types of reliability are con.tamed between two pressed. I ! : . ..as n~me \"ere pulled at random .m. ' SCores are correlated with negat.lYc correlation will probabl ' result. . ~ I SCore Variable FIG.s SCore in variable In variable 2.' a negatIve cone. The closer the bivariate distribution of scares approaches this diagonal. so art.to the lower right-hand comer.•./Iff.' me IC reasoning pro ems correctly sol\!cd ' Ia t'Ion can be expected. together with the sources of illiance identified by each. Accordingly.ou~t scores. and so on down to the poorest individual in the group../Iff. of course./Iff 11/ : :.JHt-. there is a complete reversal of scores from one variable to the other.~ a~1.'the"upper right-hand corner of :the diagram. the principal techniques for measuring the 'f}'of test scores will be examined. In SUell 'h . Correfrequentlv low When a I. The top-sl!Oring Subject in variable a 1 ~~w~edge of l~. extremes.. slowest) individ- . all individuals fall on the diagonal extending from the upper left.e no regularity in the relationshi from '. b' such variables./Iff N •• :g 60-69 'g > 50-59 c o ~ 40-49 . or relotions1Jip. the next section will consider some basic characteristics of correlation cBefficients. In this chapter. ?\fore technical discussion of correlation.lies are nearly a-lways positive. For example if time e way III which the scores are ex.e. This diagonal runs in the reverse direction from that in Figure 8. are few.IS 0 . It will be noted that all of the 100 cases in thee groBl) are distributed along "diagonal running from the lower left. I 0N N '. ' age In ot .1:1 u.with the degree of consistency or agreement between two inde'flyderived sets of scores. ..00)..iflciplesof Psychological Testing "~ould. can be . gb Essentially. a correlation coefficient (T) exses the d'egree of correspondence./Iff.' '11 .er.) d d us. this reversal being consistently maintained throughout the distribution.00. it usually results from th a IOn.'l s n~. uch a correlation would ha\'e a value of + 1.r·~tr'~ ('omnlete flbsellce of rdationship.

/ill I I \ \ . he once for all after the cross-products have been added There are many s ortcuts a . It will be recalled that . Next reading test (Y) T~ are. and the sums of the and reading scores ~~~K:t:::!t~h~ ~and~~d /~viations of the arithmetic dividing each x and y by'ts .11/I11I 1 I \ 1/1 i 0- 1 0- '? 0- '? 0- r. but also the amount n of his deviation above or below the group mean. personsfalling above the average receive positive standard scores.94 = fT. of the ~terr~latIon coeffiCient morequickest.[ ' - ./iIt.. This correlation coefficient takes into a. Rather than 1 correspon mg u to find standard scores. means of the 10 scores are given under each aJthm ti umn.=~= NUru. all./ill \ 11IIJIlt JIlt 1/1 I i tive. Bivariate Distribution for a Hypothetical Correlation of -1. 0 eSdc~le m Chapter 4.at . II \ . clearly l~ '" ~ .9.If.31 10 = 86 86 (10)(4. If the sum of the cross-products is negative.:'Ii'. the correlation will be nega- Arithmetic Pupil Bill Carol I Geoffrey Ann X Reading Y x y -4 +7 +1 -5 -3 -6 +3 -1 +2 +6 0 x:z y' 16 49 1 25 9 36 xI} 41 38 17 28 Bob Jane Ellen Ruth Dick : Mary S M fT.60 4. .))t is Simply the mean of these products. The.~ 14 0 86 = v'24. while thosebelow the average receive negative scores. These squares are used in . products will be positive.~:~~t:~~::i\~hor::uts.. deviations are squareda~n ~~: . since this conversion . hiS s~ores in the arithmetic test (X) and the the res ective c~l e sums an .4 -14 8 40 18 24 3 .. we (yS~o~. depending on the nature of the data.- ~ ~ ~ R Pears~~ I:~. provided that each individual falls on theA. we multiply each individ\i&r" tandard score in variable I by his standard score in variable 2. 48 32 34 36 41 43 47 40 400 40 22 16 18 15 24 20 23 27 210 21 +1 -2 +8 -8 -6 -4 +1 +3 +7 0 0 1 4 64 64 36 16 1 9 49 0 2~4 9 1 4· 36 186 . The meanin that l m. Table 7 shows the computation of a to each child's nam ~1e IC and reading scores of 10 children./ill ~ 040-049 u Vl o i . It will have a high positive val\ie:'W~~n corresponding standard scores are of equal sign and of approximately equal amount in the two variables./iII. TABLE 7 Computation of Pearson Product-Moment Correlation Coefficient . the corresponding cross-products will be negative. Table 7 not the rf. most common is the Pearson ProductMoment Correlation Coefficient.IIII.Reliability 107 /I I ./iII . When subjects above the average in one variable are below the average in the other. wheneach individual's standing is expressed in}erms of standard scores. Thus.ceount ot only the person's position in the group. .40 4.40 I ? "':. ' W1 In actual practice it's d~ot n~cessary to convert each raw scorc to a standard score befo' ~ can be mad . and the fourth column./ill \ 11II11II1 . fo r ./iII. while the best individualwill have the highest score on the second.The thU'? column shows the deviation (x) of the deviatio~ ero~1 thed~nthmetic mean. -= ~186 = v'18. one inferior in both woul~ have two negative standard scores.94) (4)R} = 212. b 1 some prod uc t s are posItive and some negative the correlation . re n mg t e cross-products. x wo co umns. The Pearson correlation coefficje. 'ualwillhave the numerically highest score on the first test. method demonst.:would have two positive standard scores.the Pearson correlation coefficient.00. 'll When e c ose to zero. an individual who is superior in both variables to be corre1al:ed.::g /~ore fr~m the reading mean..'''_~~i~i . = IN -.91=. 10 r. Score on Variable 1 Ic.ame side of the mean on both variables.computmg.~l . now. Correlation coefficients may be computed in variom ways.9 c 60-69 - but it illustrates the than other methods Ii > 50-59 o Jlltl/tf .I I i .

we conc U h risk of error we are willing to ta~e ignificancelevels refer to ~ e If a correlation is said to be Slgt ing conclusions from our ~. in an investigation by Anastasi and Drake (1954). we to the larger populatIOn ": 1 etic .ill11\.se 10 c 1 .ificancelevels may be. There is some 1ten h adl'ng test and vice versa. ~~e ~01 or the . means. '" " wered the question of .r. In the follOWing section.40 found 111 Ta e. correlation of ."t f error is 5 out of 100. h lchildren 0 t e sam .05 levels. the uSe of the correlation coefficient in computing different measures of test reliability will be considered.the 10 cases actually ~xamAlneth'r comparable sample of the : 'f 1 opulatlOl1.nd reading ability are corret want to know whether anthm f h e age as those we tested.To compute the _~orrelatlOn(N ) . Parenthetically. Any 'the 6 . THE RELIABILITYOEFFICIENT. and by the product of the two divided bv the number. the subjects were given five minutes to write as many words as:'they could that began with a given letter.01 and . antly greater t 1lan zelo. The correlation between the number of words written in the two forms. . The measurement of test reliability represents one application of such coefficients. n psyc d h t'cular sam1J/e of indivi ua s 1" beyon t e par 1 terested in genera lZln~ h'ch the represent. of cases ndard de~'iatiol1s (11':<Uy). computed by the Pearson Product-Moment method. For instance. he smallest corre a Ion s. This correlation is high and significant at the .01 level indicates that we can conclude. it might be added that significance levels can be interpreted in a similar way when applied to other statistical measures.s Ip t a". orrelation C C coefficients have man)' uses in the analysis of psy. the scores of 104 persons on two equivalent forms of a Word Fluency test' were correlated. if in the sample tested the bo). For interpretive purposes in this book. which usually fall in the .80's or . 'expectedfrom sample to samp . as hi h as that obtained in our 'lion is zel'O. th~ pr. the obtained correlation is somewhat lower than is desirable for reliability coefficients.05 levels for groups of different sizes can be found by consulting tables of the significance of correlations in any statistics textbook. to say that the difference between two means is significant at the . . The <!uestion usually .' 1 whether the correlation IS .n below that value simply leaves unans whether the two variables are correlated in the population from which the sample was drawn. It will be noted that the tallies cluster c~ose to the diagonal extending from the lower left. . although oglcalresearch applies 10 ed for s ecial reasbns~ . . The second form was identical. Hence. could a cOTTel~hon glne? When we say that a 0 ' d f rom sam Plmg error t a (01) level" we mean the have resulte . emp b( 7 f 'Is fo reach significance even rrelation of .chological data. uate sample 0 sUf 1 a p much higher correlatIOn.• ~.h'. The minimum correlations significant at the . Ie in the size of correlations. 'fi t at the 1 percen. alf wel on t e re . t the end as shown in the correlation form this division only once ad' ' the last column (xI)) have Th oss-pro uets m' d ula in Tab 1 7 e. d' ther group n1easures. As mlg av l' h' conc1usively~\Yith this size 0 o establish a general re at. h 'd l'f the correlation 111t e . Most at. al d 't.\ of this test was found to be . size might yield a much lowfer ort~ tl'ng the probable fluctuation . fIt' existing between the two . . except that a different letter was employed. . d' g deviations in thc x an y · lll' 1tIp l'ymg the cOITespon the sum of these cross-pro d uc t s en found by mu '( r) lumns. With 104 cases. ' lOne of the subtests of the SRA Tests of Primary Mental Abilities' for Ages 11 to 17. with only one chance out of 100 of being wrong.40 found in Table 7 ind~STATISTICAL SIGNIFICAJ'CE.~e. The data were obtained.WI . In this case. . an an) 0 .s had obtained a significantly higher mean than the girls on a mechanical comprehension test. e cr . 1 t' Igm can ' . In one form. 1 bon IS slgm can t of 100 that the population corre nare no greater than one ou h t van'ables are truly corre. cept this correlation as an 'h'ldren we can ac rmance of the.01 level.72. any correlation of . however. ' rd deviations.. . For example. Nevertheless.to the upper right-haridcorner.Reliability 109 '08 Prillcip1t's of PS!Jchological T('8ting . the trend is definitely in this direction.Y f level.10 cases it is . that a difference in the obtained direction would be found if we tested the whole population from which our samples were drawn. however IS SImp v . The two letters were chosen by the test authors as being approximately equal in difficulty for this purpose.lOn.. An example of a reliability coefficient."t "05-1eve1 is .. An ~nation of the scatter diagram in Figure 10 shows a typical bivariate distribution of scores corresponding to a high positive correlation.the . is to be found in Figure 10.25 or higher is significant at this revel. .h h arithmetic also to ~er orm If we are concerned only Wit t e ugh the relation IS not close. zero.about carre 1atlOns.90's. ' d 1 " les in this group. 1 dures or es Ima ere are stabshca proce . d vould constitute a very miously. For example. f th degree 0 re a lOn uate descriptlOn 0 e '1 ch however we are usuI holog1ca resear .05 ~eve1.. In ot er war 5. The . ht h e been antiCIpate . we could conclude that the boys would also excel in the total population.. 1 de that t e wo . h .63. amongAmencan sc 00 . only an understanding of the general concept is required. 1 t' hl'p between the arithmetic f ositwe re a Ions 11 tes a moderate d egree 0 p d for those children doing we ency reading scores.. although there is a certain amount of scatter of individual entries. no e .

1 i . an effort is made to keep the interval short. When retest reliability is reported in a test manual. This question. ecent experIences 0 a tionalstram.. A Reliability Coefficient of . Any additional changes in the relative test performance of individuals that occur over longer periods o£ time are apt to be cumulative and progressive rather than entirely random. :rhus. .72. because of circumstances peculiar to his own home. be sure. for example. but whose scores reveal an almost complete lack of correspondence when the interval is extended to as long as ten or fifteen years. yield moderat~ly stable measures within the preschool period. Since retest correlations decrease progressively as this interval lengthens. bpt the results are ~enerally discussed in terms of the predictability of adult intelligence . should not be confused with that of the reliability of a particular test. Many preschool intelligence tests. Thus. ~ M ~ ~ IT' ~ e. The . t the other These variations . . what considerations should guide the choice of interval? Illustrations could readily be cited of tests showing high reliability over periods of a few days or weeks. Flc.p long-range retests have been conducted wit~ such tests-.st ~c~res is by. however.ReliabilifY 111 I . a simple distinction can usually be made. however. since at early ages progressive developmental changes are discernible over a period of a month or even less.h~S:ase is simply the correlation . r . ·<:. the interval between retests should rarely exceed six months.:TYPES OF RELIABILITY ost obvious method for finding the reThe m. TEST-RETEST RELIABILITY. For any type of person.The I'ehablhty coeffiCIent Tn on the two administra' d by the same persons ~betweenthe scores 0bt ame d to the random fluctua. fatigue. The individual's status may have either risen or dropped appreciably in relation to others of his own age.' .period. or artistic judgment may have altered appreciably over a ten-year.. 1111 I 1 \ i 1111 '. Retest reliabIlIty sows. . f test seSSIOn 0 • tionsof performance rom one n d t ting conditions such as extreme may result in part from uncontr? e eds ther distractions or a broken . and so forth. such as educational or job experiences. th dden nOlses an 0 '." \ i -HH" 1 \ " : I. owing to unusual intervening experiences.jilt I \o/Ht'lII. they are likely to characterize a broader area of behavior than that covered by the test performance itself. the can hr I!eneralized over different occaSlDns.~et:1ks. or even one year.extent to which such factors can affect an individual's psychological development provides an important problem for investigation. When we measure the reliability of the StanfordBin~t.l0.. Moreover. the period should be even shorter than for older persons. h'd ntical test on a second occaliabilityof te. h changes m wea er.(Dalafrom Anastasi & Drake. e enor . " sian. or for other reasons such as illness or emotional disturbance. worry. emocondition of the subject h1l11Se as 1 f pleasant or unpleasant nature. rcpeCatll1)gi: .1. In actual practice. random fluctuations that occur during intervals ranging from a few hours to a few months are generally included under the error variance of the test score. variance correspon s " lionsof the test. In testing young children. \ I I : . but over a few . It is also desirable to give some indication of relevant intervening experiences of the subjects on whom reliability was measured. school. (). '-T. in checking this type of test reliability. mechanical comprehension. l 1954. r. the interval over which it was measured should always be specified. we do not ordinarily correlate retest _~~res over a penod of ten years. \-1 . e . Short-range. h the extent to which scores on a test and the like. su h they arise from changes in t e pencil point. counseling. th higher the reliability. for example. or community environment. there is not one but an infinite number of retest reliability coefficients for any test. psychotherapy. lfowev~~~strated by illness. ~ -0. less susceptible the scores are to the random daily changes in the condition of the subject or of the testing environment. but are virtually useless as predictors of late childhood or adult IQ's.) . To so~e ext:nt. one's general level of scholastic aptitude. Th .."f0'0"t ("") M ~ ~~ I I $ 1 It) ~ I 0 " ~ N b (") Score on FormJ: Word F veney J. Apart from the desirability of reporting length of interval.

however. error vanance under ferent individuals the relat' . I measure for e\'al~at' 'ever. Since both types are important for most Like lest-retest rcliabilit.~:~~~ :ow suppose that a second Items are constructed with I ame purpose. d'ffi aftors In the past experience of difwhat from pcrso~ to pe !VeT]· cu ty of the two lists will vary Some1 rson. '~iquepresents difficulties when applied to most psychological tests.~:~:e many of the items covered the easion.. while A will re therefore be reversed o'n th t a ]. Only tests that are not appreciably affected by. One way of avoiding the difficulties enuntered 1n test-retest reliability is through the use of alternate forms the test. It will be noted that such a reliability efficientis a measure of both temporal stability and consistency of nse to different item samples (or test forms). Everyone a e expenence of taking . The same persons can thus be tested with one form on the stoccasjon and with another.ua A than does the second list. To ticular selection of items? I:sa ~'ff epen? on ~actors speci~c to the parently. ~gainwe must fall back on an analysis of the purposes of the test and 9iJ a thorough understanding of the behavior the test is designed to preBiet.. ' wmg 0 c anee differences in the I • Reliability 113 .Prillciples of PsycllOlogical Testing omchildhood performance.£ ' . random changes that characterize the test performance itself .effqua can~ to cover the same range of diffi. The concept of item sam Iin ' alternate-form reliability bu~ al~ ~. ALTERNATE-FORM RELIABILITY. For the large a .. accompanied by a stateme~' f t~rntc. The tests :h~ ~nstruct~ tests desi~ed to meet the U ('Ontam the same number of 1 elDS. The e represents uctuat'o' f one set of items to another b t H .n~nc: .'if!.. rather than in terms of the reliability of a rticulartest. It is the f . The concept of reliability is generally restricted to shortge. the same pattern of right and wrong responses _likelyto recur through sheer memory. I al e to reVICW.ry in the extcnt of daily fluctuation they exhibit. e wo Ists o' t h selection of items. if the interval between reestsis fairly short. same speci caLet us suppose that a 40't VI ua bS slcore differ on the two tests? -I em voca u ary t t h b a measure of general verbal c . If wish to obtain an over-all estimate of the individual's habitual finger diness. WO 111 IVI ua ~ "true scores") B' will neverth I r overa word knowledge (i. the resulting . in thei~ excel B on th~ second The eIe~ excel A on the first list. we would probably require repeated tests on several days. Moreover. e .workmg independt' h s In accor d ance with the 'fi IOns. y across orms only not .I erences 111 the sco e bt' d b m lVIduals on these two tests 'II t r s 0 ame y the same . lPractice will probably produce varying amounts of improvement in the ~testscores of different individuals. :'l' Although.. parallel forms same specifications. 't . A number of sensory dis(~riminationnd motor tests would fall into this category. It should be noted that different behavior functions may themselves . This familiar what extent do Scores on th. the retest technique is inapropriate.~rm rdl~bIhty should always be ministrations as well as ado . alt . he may have had th . IUS the Ii t I' t .'Jetin lend themselves to the retest technique.d l' . and that the cultv as the first test The d.majorityof psychological tests.esu ting from content sampling.?f ('Ourse be exerof a test should be jnd~endc t{ parallel. correlation shows reliabilit Ifn Immediate succession.r. were to pre!)are another te It ~rent IO vestlgator.s IS mIg t contain a larp. The corlation between the scores obtained on the two forms represents the 'ability coefficient of the test. reas a single test session would suffice for verbal comprehension.. For example. Owing to fortuitous f . This coefficient thus binestwo ty.hcontellt salllpl~llg: ?lIderlies not only shortlv.e. error vanance in this cas 8' ' across occasIOns. F~ndamentaJ)y.e. the scores on the two ad1Jlinistrations the test are not independently obtained and the correIaof between them will be spuriously high.s . er types of reltabIhty to be discussed re ore appropnate to ex . The natt\re of the test itself ay also change with repetition. comparable form on the second.ilherthan the entire behavior domain that is being tested.In other words. Once the subject has grasped the princiinvolvedin the problem.elThe second list on the oth h d . the test-retest tech. On another oclarge number of l't e opposite expenence. This is especially true of problems inlyingreasoning or ingenuity. ow much would an indi .. If t h·' two forms are administered' Ion 0 relevant' In t ervenmg experiences. situation illustrates error va . he can roduce the correct Iesponse in the future without going through the ervening steps. d: ..apparently simple and straightforward. .I ns In per ormance from In the d I ' u not uctuations over time eve Opment of alternate forms h Id· . cised to ensure that the are trul ' care s ou . the examinees may recall many of their former reooses. steadiess of delicate finger movements is undoubtedly more susceptible to . ht changes in the person's condition than is verbal comprehension. h number of words unfamiliar to individ . t~ engft of the mterval between test adescnp . . ur once he has worked out a solution. testing purposes 110. d"d I are apprOXimately equal in thei II .as een constructed as ~ist of 40 different words is ass~:b~:~e.IUS rate the type of ' conSIderation. finding an unusually ems on areas he had f 'I d . 't has probably h d th' amlOe 1 more close lv. ed mo~t carefully. .he felt he had a "I k b k" a course exammatlOn in \vhich very topics he happue~:d t~e~aveb. a temate-form reliability provides a useful mg many tests. .pes of reliability. er 1arge number of' words unfamiIi an t mIght'd con t'am a d'Isproportionately d' ar 0 111 IVI ua lB If the t .ve standing of these two persons will . Thus.

4!It \ 1/: j 1/1 \ " I III/ 1/11 '. school. since at early ages progressive developmental changes are discernible over a period of a month or even less.Reliabilify 111 I . f e test seSSIOn 0 • " tions of performance rom on II d t t'ng conditions such as extreme . less susceptible the scores are to the random daily changes in the condition of the subject or of the testing environment. or community environment. vanance correspoll S '. random fluctuations that occur during intervals ranging from a few hours to a few months are generally included under the error variance of the test score. In actual practice. the can he I':eneralized over different occaSIOns.infinite number of retest reliability coefficients for any test." : \ \I : . but are virtually useless as predictors of late childhood or adult IQ's. mechanical comprehension.flit I \. :rhus. the interval over which it was measured should always be specified. Apart from the desirability of reporting length of interval. Th . t the other These variations .. the higher the reliability. e enor . The individual's status may have either risen or dropped appreciably in relation to others of his own age. 0 so~e . Moreover. e . or for other reasons such as illness or emotional disturbance.extent to which such factors can affect an individual's psychological development provides an important problem for investigation. because of circumstances peculiar to his own home. we do not ordinarily correlate retest :~~res over a period of ten years. .. one's general level of scholastic aptitude. and so forth. emocondition of the subject hmlsel : as 1 f pleasant or unpleasant nature. or even one year. d to the random fluctua. what considerations should guide the choice of interval? Illustrations could readily be cited of tests showing high reliability over periods of a few days or weeks. counseling. bpt the . worry. This question.. an effort is made to keep the interval short.t :h~ ase is simply the correlation . Short-range. su th y arise from changes m t e . pend pomt. for example. d ther distractions or a bro en I • h dden nOlses an 0 " h changes 111 we at er. · recent expenences 0 a t tionaI stram. The m.. h 'dentical test on a second occa. but over a few weeks. onMFormJ: Word fluencY Test 10. owing to unusual intervening experiences. Since retest correlations decrease progressively as this interval lengthens. I \ \ " \ \ 4!It \ . It is also desirable to give some indication of relevant intervening experiences of the subjects on whom reliability was measured. h the extent to which scores on a tes and the like. liabilityof test scores is by. n the two administra. . I' rt f ncontro e es 1 ' k may resu t 111 pa rom u . or artistic judgment may have altered appreciably over a ten-year. Many preschool intelligence tests. there is not one but an . but whose scores reveal an almost complete lack of correspondence when the interval is extended to as long as ten or fifteen years. d b the same persons 0 \[1betwe~i'Ithe scores 0b tame Y. psychotherapy. Any additional changes in the relative test performance of individuals that occur over longer periods of time are apt to be cumulative and progressive rather than entirely random. When we measure the reliability of the StanfordBinet.72. '!G. T extent however. the period should be even shorter than for older persons. When retest reliability is reported in a test manual. . for example. For any type of person. yield moderarely stable measures within the preschool period. .1 i I \-i. however. Thus.fcsults are generally discussed in terms of the predictability of adult intelligence . The .) Data from An8~tasi & Drake.The l'eliability coefficlenf (Tn III IS C. A Reliability Coefficient of . period.fIIt1H1 ! 0Ii') ~ 0() I 0I Ii') 0() 0() ~ "-1 0 "- I 0 Ii') 0 ~ Ii') ~ 0 Ii') Ii') Il'l 0 0() sc:e . In testing young children. they are likely to characterize a broader area of behavior than that covered by the test performance itself. should not he confused with 'that of the reliability of a particular test.: 'sion. however. a simple distinction can usually be made. ':TYPES OF RELIABILITY ost obvious method for finding the reTEST-RETEST RELIABILITY. 1954. such as educational or job experiences. rcpeatlll)g. in checking this type of test reliability. the interval between retests should rarely exceed six months. tions of the test.' f 'Uustrated by illness.'~'£p he SUfe~ long-range retests have been conducted wit~ such tests:. fatigue. Retest rehablhty sows.

. I us rate the type of ' consIderation. Since both types are important for most . er types of reltabllIty to be disclIssed has p. The concept of reliability is generally restricted to shortnge. he may have had th ' ed mo~t carefully. Reliability 113 Althoughapparently simple and straightforward.hcontellt sampl:llg: ~nderlies not only short Iv. rather than in terms of the reliability of a rticulartest.etitiDn I~nd themselves to the retest technique. h number of words unfamiliar to individ rs IS mlg t contain a larger The second list on the oth h d .eff can.racticewill probably produce varying amounts of improvement in the . he can produce the correct response in the future without going through the itervellingsteps. ~:~:~t~. e . . r 0 111nlVI ua B. I erences 111 the sco e bt' d b III JVldua]s on these two tests ']1 t r s a ame y the same . very topics he happen~d to have studi many 0 t e Items covered the casion. steadiof delicate finger movements is undoubtedly more susceptible to ht changes in the person's condition than is verbal comprehension. The coration between the scores obtained on the two forms represents the 'ability coefficient of the test.~nn rell~blhty should always be ministrations as well as ado . . The same persons can thus be tested with one form on the stDccasjonand with another. This familiar what extent do Scores on ~~n~nc: . One way of avoiding the difficulties enimteredin test-retest reliability is through the use of alternate forms the test. Dnses. ' If t h·' two forms are administered' Ion 0 relevant 111 t ervenmg experiences. t~ engft of the Interval between test adescnp . finding an unusually ems on areas he had f 'I d . n other words. er might 1arge number of' words unfamilia an t' .e. If wish to obtain an over-all estimate of the individual's habitual finger diness. cised to ensure that the are tm] . Thus. were to preIJare another t t' d ' wor mg In ependt' h es m accor ance with th 'fi IOns. B -will neverthele:s : word knowledge (i. Fundamentally.obably h drethoreappr. eWolstso' t h selection of items. A number of sensory dis. Owing to fortuito f ' error vanance under ferent individuals the relat' d~~ ators In the past experience of difwhat from pcrso~ to pe Ive ·I cu ty of the two lists wiII vary SomerSOll. The concept of item sam tin ' altemate-fOlm reliability bu~ al~ ~. TIlUS the fi t I' t .~~':~~~~~:g'enYlear. however. Everyone a e expenence of tak' g . a I erent mvestigator k' . the test-retest techique presents difficulties when applied to most psychological tests.dr. d ent Iy.criminationand motor tests would fall into this category. It will be noted that such a reliability cient is a measure of both temporal stability and consistency of nse to different item samples (or test forms). the examinees may recall many of their former I'e. For the large ajority of psychological tests. and that the qua culty as the first test The d. if the interval between res is fairly short. the resulting . This coefficient thus binestwo types of reliability.testscores of different individuals. '. On another oclarge number of I't e opposIte expenence. It should be noted that different behavior functions may themselves . we would probably require repeated test~ on several days. he felt lIe had a "I k b k» 'In a course examination in which uc v rea because f h . accompanied by a stateme~' f t~m:te.ve standing of these two persons will . while A will re therefore be reversed o'n the t atll. ow much would an indi . parallel forms same specifications. I ns In per ormance from In the d I ' u not uctuations over time eve Opment of alternate forms h Id" . Only tests that are not appreciably affected by"lfi. the same pattern of right and wrong responses I .4 likelyto r~cur through sheer memory. the scores on the two adinistrationsof the test are not independently obtained and the correIan between them will be spuriously high. or once he has worked out a solution.esu ting from content sampling. . The tests :h~ ~nstruct~d tests desi~ed to meet the U contam the same number of items . error vanance in this cas fl' ' across occasIOns. Once the subject has grasped the pdnci"Ieinvolvedin the problem. care s ou . To ticu]ar selection of items? Ifls eds'Hepen? on factors specific to the par. in the extent or daily fluctuation they exhibit.opnate to examine it more closely. the retest technique is inap' opriate. 'hereas a single test session would suffice for verbal comprehension. random changes that characterize the test performance itself therthan the entire behavior domain that is being tested.of (. situation illustrates error' I al e to reVICW. ALTERNATE-FORM RELIABILITY. e same specI caLet us suppose that a 40-'t VI ua bS slcore differ ort the hm tests? I em voca u ary test h b a measure of general verbal c h' . The e represents uctuat'o' f one set of items to another b t R . to cover the same range of dim. ' wmg 0 c ance differences in the Like lest-retest rcliabilit· alt 'f '. For example. The natnre of the test itself :ayalso change with repetition. gainwe must fall back on an analysis of the purposes of the test and i1 a thorough understanding of the behavior the test is designed to pret.. in their excel B on the second Th ]. correlation shows reliabilit 'fn Immediate succession. altternate-form reliability provides a useful ny ests. d: . Moreover.as een constructed as ~ist of 40 different words is ass~:1~:d e:::~~~ ~ow suppose that a second Items are constructed with I ame purpose.ua] A than does the second list. d"d conta'n a d'Isproportionately I I are apprOXimately egual in the. y across orms only not . This is especially true of problems inDIving reasoning or ingenuity. comparable form on the second. If the two individual~ ov ra "true scores").Prillciples of Psychological Testing om childhood performance. It is the f . e cel A on the first list. .'ourse be exerof a test should be ind~endc t{ parallel.

the clitions. since only a single administration of a single form is required. Alterte forms are useful in' follow-up studies or in investigations of the ects of some intervening experimental factor on test performance. if it is decreased determining reliability bv ~heP:ari~~ntrown formula is Widely used in porting reliability in this 'fo p a f m. the correlation between their scores would remain un.since adding a constant amount to each score does not alter the <:orrelationcoefficient. however. as questions rereading test. -' on t e other hand. or examp e. wll Increase 0 I "t. the!'use of alternate forms will reduce but not eliminate such an 'effect. however.ctors varying progresquate for most purposes is to fi d th e es. The useof several alternate forms also provides a means of reducing the possibilityof coaching or cheating. ' owever. n is %. I One precaution to b b d . m~ny test manuals re. as well as to the cu I t' If f Ul). Reliability lIS To find split-half reliabilit tl Ii.. ~e on?llndally an. ' 2 . other techniques for estimating test reliability are often required. procedure that is adeof the test. any item involving the same principle can be readily solved by most subjects once they have worked out the solution to the first. changing the specific content of the items in the second form would not suffice to eliminate this carry-over from the first form.'Jto 100. it should be added that alternate forms are unavailable for many tests..e I:e~ls In such a group to be placed spuriousl inflated' . motivation in taking the test. ue. format. h' an a -even sp It pertains ea mg WIt a smale problem h ferring to a particular mechanical di~ . It is reasonable t . e scores on the odd and even items of difficulty such a dl' . WIth a lar If' arrive at a more adequate and . and I other aspects of the test must likewise be checked for comparability. Under these conditions.t~n~a~~. m er 0 Items In the test Other thmgs being equ I th I . In such a way. e a serve 111 making such dd I' to groups of items d l' .leerror in understanding of the problem th Once the two half-scores have b b' d be correlated by the usual m th een a tame for each person. In such a case. Another related question concerns the degree to which the nature of the test will change with repetition.. l+(n-l)r'u in which t is the estimated ffi' n is the number of times th ~o~.d'b~lt. " . It should be added that the availability of parallel test forms is desirIe for other reasons besides the determination of test reliability. In this case a whole r glam. Under these con- =: ~--.. the more reliable it will be? o expect t Iat. reliability by various split-half procedures. 1965). because only one test session is involved: This type of reliahility coefficient is sometimes called a coefficient of internal consistency. Although much more widely applicable than test-retest reliability. the practice effect represents another source of variance that will tend to reduce the correlation between the two test forms. order to obtain th y.. I S conSIstency m tenns of con. VIsIon Yle s verv ne I· . shoufld be noted. if the behavior functions under consideration are subject to a large practice elfeet. formula always involves do~~in"'~~: tpphed to spht-haIf reliability.ls bas~d on only 50 items. ' sue.I . they may correlation actuallv gives th e °l.' e ec at engt means of the Spearman-Bra f e all Its ~oefficlent can be estimated by I wn ormu a. . Finallv.' n y. Ie 1st problem IS how to split the test ill divided in man ~ most nearly comparable halves. ger samp e a behaVIOr. In certain types of ingenuity problems. each score is . Temporal stability of the scores does Ilot enter into such reliability. for example.Principles of Psychological Testing :.:~. owing to extent of previous practice with similar material. Any test can be second half w~urd dl~e~ent wars. if all examinees were to show the same improvement with repetition. It is apparent that split-half reliability provides a measure of consistency with regard to content sampling.~~~~. time limits. If the items we . Thus. For all these reasons.h ' .ethod. Th sse rom 2. n is 4.n. tween two sets of scores each a . ~11 the obtained coefficient. it can be simplified as f~Iows:ength of-the test. If the practice effect is small. " . because of the practical difficulties of constructing comparable forms. . It is much more likely. two scores are obtained for e~c1i person by dividing the test into comparable halves.anged in an approximate order . c~ent. a I' e ~nger a test. gIVen below: 'II SPLIT-HALF RELIABILITY. a II} over hme (see Cureton.. al"temate-form reliability also has certain limitations. illustrative examples. that this 'f hoe re la I It" a onlv a half test F 'I I t e entire test consists of 100 ite . owing to differences in nature and I ems. tent samplmg not its sl b'I't . })ractice fatig b d mu a Ive e ects 0 warming . '. . and other factors. e Slml anty of the half-scores would be might aIf~ct items 'i~l~c. .' ar)' eqUlva ent half-scores. .d the 'items should be expressed in the same form and should cover the metype of content. In both based on the full nu b f ' . that individuals will differ in amount of improvement. . reduction will be negligible."affected. The range and level of difficulty of the items should o be equal. and number of test items is incr:a eS ~ eng~ ened or shortened. t e correlatIon IS computed betest-retest and alternate-fotm r:I. Instructions. . To be sure. From a sin'gle. In the first place. if the d from 60 to 30.ms. the Rrst' half and the no difficulty level of 't e comparable. In most tests. we can . or to a gIven passage in a tact to one or the other h~lf \Vere ~ o~p of ~tems should be assigned inin different halves of the t~st th .----_ nr'lI '1 Lenulhening a test h . ore am and am' tI f sively from the beginning to th~ end ~f at Ie. .:administration of one form of a test it is possible to arrive at a measure 'of. ' consIstent measure The ff t th I h emng or shortening a test will hav .

Principles of Psychological Testing 2r'1I = 1 + r'lI
Reliability
U7

Tn

. s it-half reliability was developed by An alternate method for findmg p. f th differences between . 0 Ily the vanance a e I Ion (1939). It reqUires I If t ( , ) and the variance of tota ' the two ha -tes Sad f I ch person s scores on b 't t d in the following ormu a, res (a'r); these two values aTe su stJ u e. ,. hich yieids the reliability of the whole test duectl) .
111

= 1- -,u:;

u'e!

,r , hi of this formula to the definition of . It is interesting to note the relations p 's scores on the two half'. , A d'ff ce between a person . 'd d 'errorvanance. ny I eren 'f these differences, dlvl e ' h r The vanance 0 , . 'tests represents c ance eTTO. , 'es the roportion of error variance 111 by the variance of total scores, gl\ 'b P t d from 1 00 it gives the h' 'ariance IS SU trac e , , he scores. When t IS error \ h' h . I to the reliability coefficient. proportion of "true" variance, w IC IS equa . A fourth method for finding reliability, f . I form is based on the . 1 d" t 'ahon 0 a slllg e , also utiliZing a slIlg e a mmlslII , , the test This interitem conf onses to a Items m . consistencv 0 resp f ariance' (1) content samd by two sources a error v , h ,;:sistenclj is ~n uence . d s lit-half reliability); and (2) etero\1 piing (as III altemat~-form an. p m led. The more-homogeneous the geneitv of the behavlOr domalll sa.P ' For example if one test int ' • h' h tl . lteritem conSIS enc\. , b domain, the Ig er Ie 11 h'1 lo'ther cOllllJrises addition, su _ I . I' l' 'tcms w leal b a hI y " eludes only mu tip Ica IOn I ..'.. the former test will pro I· I' t' and dIVISIOnItems, ' traction, mu tip Ica lOn, h th latter In the latter, more . 't onsistenc\' t an e, h . ' show more mten em c "f better in subtraction t an III ' t t e subJ'ect ma\' per orm 1 ' heterogeneous es, on. "ons' another subject may score re a~, any of the other arithmetIc operatl ly in addition, subtrach d' " 'tems but more poor b tively well on t e IVI510n I , A ore extreme example would e tion and multiplication; and so on. mb I items in contrast to one ' b t . ti I IT of 40 voca u ary, . represented y a tcs consls I/::). I I t'ons 10 arithmetic reasomng, b 1 10 spaha re a I 0, ' containing 10 voca u ar~, ~ the latter test, there might be little or and 10 perceptual speed Item~'dI. 'd r performance on the different no relationship between an III IVI ua s
KUDER·RICHARDSON RELIABILIT1:..

·of 20 by the successful completion of 5 pcrccptual speed, 5 spatial relations, 10 arithmetic reasoning, and no vocabulary items, Many other combinations could obViously producc the same total score of 20. This Score would have a very different meaning when obtained through such dissimilar combinations of items. In the relatively homogeneous vocabulary test, On the other hand, a Score of 20 would probably mean that the Subject llad succeeded with approximately the first 20 words, if the items were arranged in ascending order of difficulty, He might have failed two or three easier words and correctly responded to two or three more difficult itcms beyond the 20th, but such individual variations are slight in comparison with those found in a more heterogeneous test . A highly relevant question in this connection is whether the criterion that the test is trying to predict is itself relatively homogeneous or heterogeneous. Although homogeneous tests are to be preferred because their Scores permit fairly unambiguous interpretation, a single homogeneous test is obViously not an adequate predictor of a highly heterogeneous criterion. lvforeover, in the prediction of a heterogeneous criterion, the heterogeneity of test items would not necessarily represent error variance. Traditional intelligence tests provide a good example of heterogeneous tests designed'to predict heterogeneous criteria. In such a case, however, it may be desirable to construct several relatively homogeneous tests, each measuring a different phase of the heterogeneous criterion, Thus, unambiguous interpretation of test scores could be combined with adequate criterion coverage. The most common procedure for finding interitem consistency is that developed by Kuder and Richardson (1937). As in the split-half methods, interitem consistency is found from a single administration of a single test. Rather than requiring two half-scores, however, such a technique is based on an examination of performance on each item. Of the various formulas derived in the original article, the most Widely applicable, commonly known as "Kuder-Richal'dson formula 20," is the follo ing: w 3

,

'. a

0'

0

0

,

.'

types of items. ill be less ambiguous when derived ., It is apparent that test scores w h t'. the highly heteroget ts Suppose t a III from relatively homogeneo~ es S' 'th and Jones both obtain a score of neous, 40-item test cited ave, rfml s of the two on this test were e 20, Can we conclude that the Ph ormance tly completed 10 vocabulary ? N t II Smith may aye correc .. equal. ot a a . 's and none of the arithmetic reasomng items, 10 perceptual speed ~tem 't t Jones may have received a score and spatial relations items, neon ras ,

In this formula, rll is the reliability coefficient of the whole test, n is the number of items in the test, and IJ't the standard deviation of total SCOl'es on the test. The only new term in this formula, 'S.pq, is found by tabulating the proportion of persons who pass (p) and the proportion who do not pass (q) each item. The product of p and q is computed for each item, and these products are then added for all items, to give ~pq. Since in the ptocess of ~est construction p is often routinely recorded in order
3

A Simple dcrivatiolJ of this formula can be found in Ebel (1965, ppo 32!hS27).

u8

Pri'lcipks of Psychological

Testing

Reliability

119

the difficulty level of each item, this method of determining rci~bility involves little additional cO,mputation. l' bT ,fIt can be shown mathematically that the Kuder-Ri~hardson r~ la Ilty , cient is actually the mean of aU split-half coeffiCients .resultll1~ from ent splittings of a test (Cronbach, 1951).4 The ordmary spht-half dent, on the other hand, is based on a planned split design~d to equivalent sets of items. Hence, unless the test items are hIghly mogeneous, the Kuder-Richardson coefficient will be .lo\~er than t~e lit-halfreliability. An extreme example will serve to hl.ghlight t?e dlf erence.Suppose we construct a 50-item test out of 25 diHerent kmd~ a emssuch that items 1 and 2 are vocabulary items, items 3 and 4 antheticreasoning, items 5 and 6 spatial orientation, a~d so on. The odd.and venscores on this test could theoretically agree qmte clos:ly, thus. YIeld'ng a high split-half reliability coefficient. The homogeneity of. thiS test, ince there would be little consistency of owever,wou Id be very low • S " ld erformance among the entire set of 50 items. In thIS example, we wou. '~xpectthe Kuder-Richardson reliability to be much lower th\lD th~ splIthalfreliability. It can be seen that the diHerence between Kuder-~Ichard,son and split-half reliability coefficients may serve as a rough ll1dex of

i6'find

f

i,

the heterogeneity of a test. . The Kuder-Richardson formula is applicable to tests whose Items are scored as right or wrong, or according to some other all-or-none syste~. Sometests however may have multiple-scored items. On a personahty inventory,for exampie, the respondent may receive a di,~erent n,~~erical score on an item, depending on whether he checks . usually, some. " " I" "ne\1el'" For such tests a generahzed formula has times, rare y, or· ' . k been derived known as coefficient alpha (Cronbach, 1951; NOVIC & Lewis, 1967).' In this formula, the value ~pq is replaced by ~u'i, ~he sum of the variances of item scores. The procedure is to find the vana~ce of all individuals' scores for each item and then to ~dd these v~na~ces across all items. The complete formula for coeffiCIent alpha IS glVen below:

_ (~)U't - ~U';
TlI -

one case, error variance covers temporal fluctuations; in another, it refers to differences between sets of parallel itcms; and in still another, it includes any interitem inconsistency. On the other hand, the factors excluded from measures of error variance are broadly of two types: (a) those factors whose variance should remain in the scores, since they are part of the true differences under consideration; and (h) those irrelevant factors that can be experimentally controlled. For example, it is not customary to report the error of measurement resulting when a test is administered under distracting conditions or with a longer or shorter time limit than that specified in the manual. Timing errors and serious distractions can be empirically eliminated from the testing situation. Hence, it is not necessary to report special reliability coefficients corresponding to "distraction variance" or "timing variance." Similarly, most tcsts provide such highly standardized procedures for administration and scoring that error variance attributable to these factors is negligible. This is particularly true of group tests deSigned for mass testing and computer scoring. 'With such insb'uments, we need only to make certain that the prescribed procedures are carefully followed and adequately checked. 'Vith~clinical instruments employed in intensive individual examinations, on the other hand, the!'e is evidence of considerable "examiner variance:' Through special experimental designs, it is possible to separate this variance from that attributable to temporal fluctuations in the subject's condition or to the use of alternate test forms. ~ne source of error variance that can be checked quite simply is scorer vanance. Certain types of tests-notably tests of creativity and projective tests of personality-leave a good deal to the judgment of the scorer. \Vith such tests, there is as much need for a measure of scorer reliability as there is for the more usual reliability coefficients. Scorer reliability can be found by having a sample of test papers independently scored by two examiners. The two scores thus obtained hv each examinee are then correlated in the usual way, and the resulti~g correlation coefficient is a measu,re of scorer reliability. This type of reliability. is commonly computed when subjectively scored instruments are e.mployed in research. "»est manuals should also report it when appropriate. '

n- 1

u't

A clear description of the computational

layout for finding coefficient alpha can be found in Ebel (1965, pp. 326-330).
SCORER RELIABILITY. It should now be apparent that the difIer:nt types of reliability vary in the factors they subsume under error vananee. In
4 This is strictly true only when the split-half coefficientsare found by the Rulon formula,not when they are found by correlation of halves and Spearman-Brown formula(Novick & LewiS, 1967).

OVERVIEW. The diHerent types of reliability coemsiel),ts discussed in this section are summarized in Tables 8 and 9. In Tablit18'the operations followed in obtaining each type of reliability are classffled,-,with regard to number of test forms and number of testing sessions required. Table 9 shows the sources of variance treated as error vitri~nce b},;,~achprocedure. Any reliability coefficient may be interpreted directly"in terms of the percentage of score variance attributable to different sources. Thus, a reliability coefficient of .85 signifies that 85 perceI1t 9f the variance in test

38 .. expressed in the more familiar percentage terms. and mterscorer difference (.. Experimental designs that yield more than one type of reliability coefficient for the same group permit the analysis of total score variance into different components.~io!<!'.~""F.t.r~C'~<..10· (time sampling) 1.:.~~')l.. .!:.t.92 scorer reliability: = . II 'II.38 and hence a true variance of . Spearman-Brown reliability: Time sampling Content sampling Time sampling and Content sampling Content sampling Content sampling and Content heterogeneity Interscorer differences DiHerence TWDl 1 .70. in Relation to Test Form andTesting Session Testing SessionS Test Forms Required Required Split-Half Kuder-Richardson Scorer A1temate-Form (Immediate) Alternate..".~. Adding the error vari~nces attributable to content sampling (.based on two dilTerentsets of assumptions.92 is obtained.-. 6 For a better estimate of the coefficientqf internal consistency..·~ •.20 + ..split-half correlations could be computed for each fonn amI the two coeffiCients averaged by the appropriate statistical procedures.-.::.08 True Variance = 1.-)... When the index of reliability is squared."Ulr'&.30 = . The statistically sophisticated reader may recall that It 's the square of a correlation coefficient that represents proportion of ommanvariance. a split-half reliability coefficient can also be computed. the result is the reliability coefficient (r1l).6 This coefficient.10 + .62.•.- Two \ Test-Retest .fW"6'.38 5 Derivations the indexof reliability. It will be noted that by subtracting the en'or variance attributable to content sampling alone (split-half reliability) from the error variance attributable to both content and time sampling (alternate-form reliability)..est-Retest lemale-Form(Immediate) emale-Form(Delayed) lit-Half From delayed alternate-form reliability: 1 .~.08) gives a total error variance of .':'..l:::i. The resulting alternateform reliability is . is .10 of the variance can be attributed to time sampling alone. by . :.• ••.70 = ...80.OS· (interscorer difference ) er-Richardsonand Coefficient Ipha rer Total Measured Error Varianetl· .:..Reliability lZ0 121 Principles of Psyc11010gical Testing TABLE 8 Techniquesfor Measuring Reliability.:~ scores depends on true vati~nce in the trait measured and 15 percent epends on error variance (as:'opcrationally defined by the specific pr~edure followed). of \given Gulliksen (l950b. Forms A and B of a creativity test have been administered with a two-month interval to 100 sixth-grade children. Finally.. known as th6 index of re1iabdity. TABLE 10 Anal)'sis of Sources of Error Variance in a H}'P0thetical Test :fABLE 9 . stepped up by the Spearman-Brown formula.-:.~ •• 1'l...~. . From the responses of either form. Let us consider the following hypothetical example.:.!':"i·:.•.20).62 = = = . ..ourcesof Error Variance in Relation to Reliability Coefficients Type of Reliability Coefficient .c.SO . Chs. . time sampling (_10)..tr.. The three reliability coefficients can now be analyzed to yield the error variances shown in Table 10 and Figure n..-.at.Form (Delayed) . '-\.(. are shown graphically in Figure II.:. from which a scorer reliability of . i . This correlation.:.. a second scorer has rescored a l'andom sample of 50 papers.I.c~..Y:-:_~ :_~~. we find that ...... These proportions. which can therefore be interpreted directly as the percentage of true variance.. 2 and 3). 4':~~ __'~. the proportion of true variance in test scores 'sithe square of the correlation between scores on a single form of the est and true scores free from chance errors.. i' '. .20· (time samplin'k plus content sampling) (content sampling) From split-half.~ is equal to the square root of the reliability co- efficient (\/.::.'C:J".tJ'. Actually..

Error Variance:
A_-

10

38'J. --x.--8-'X,-"'"

Stable over lime; consistent over !orms; free !rom interscorer difference

11. Percentage Distribution of Score Variance in a Hypothetical Test.

'LIABILITY OF SPEEDED TESTS
"

oth in test construction and in the interpretation of test scores, an portant distinction is that between t~e ~ea.s~rement. of speed and of wer. A pure speed test is one in whIch md1~dual differences depend tirel\, on speed of performance. Such a test IS co~s~ructed fr~~ Items uniformly low difficulty, all of which are well wI~hm ~he. a?lhty level the persons for whom the test is designed. The hme 1Im1t.1~made so ort that no one can finish all the items. Under these conditIons, each erson's score rcflects only the speed with which he worked. A pur~ DICeI' test, on the other hand, has a time limit long el:ough ~o permIt veryone to attempt an items. The difficulty of the Items IS steeply , raded, and the test includes some items too difficult for anyone to solve, sothat no one can get a perfect score. " It will be noted that both speed and power tests are deSIgned to p~e-" vent the achievement of perfect scores. The reason for such.a precauh~, is that perfect scores are indeterminate, since it is impos~lble to .knm.Y how much higher the individual's score would have been If m?re.l~ems, 'ffi It items had been included, To enable each mdlVldual or more d I cu, ,,' .d d to show fully what he is able to a~c,qm1?H,~rthe test must proVI e a e. qllate ceiling, either in number o~ ~te"':iJr in. difficulty level. An..ex~ep~ lion to this rule is ,found in mastery ,Jng, as Illustrated by the cllt~no~ referenced tests discussed in ChaPtrc4. The purpose of such testm~ IS not to establish the limits of what th'e3hdividual can do, but to determme whether a preestablished performance level has or has not been rea.ehed. In actual practice, the distinction between speed and power :ests IS ~nc of degree most tests depending on both powe~ and speed 111 varymg proportiO~S. Information about these proportions is needed for each test . rder not onlv to understand what the test measures but also to ~o~se the prop~r procedures for evaluating its reliability. Single-trial reliability coefficients, such as t~ose found by odd-even or Ku.derRichardson techniques, are inapplicable to speeded tests. To the extent

that individual differences in test scores depend on speed of performance, reliability coefficients found by these methods will be spuriously high. An extreme example will help to clarify this point. Let us suppose that a 50-item test depends entirely on speed, so that individual differences in score are based wholly on number of items attempted, rather than on errors. Then, if individual A obtains a score of 44, he will obviously have 22 correct odd items and 22 correct even items. Similarly, individual B, with a score of 34, will have odd and even scores of 17 and 17, respectively. Consequently, except for accidental careless errors on a few items, the correlation between odd and even scores would be perfect, or + 1.00. Such a correlation, however, is entirely spurious and provides no information about the reliability of the test. An examination of the procedures followed in finding both split-half and Kuder-Richardson reliability \:vill show that both are based on the consistency in number of errors made by the examinee. If, now, individual differences in test scores depend, l~ot on errors, but on speed, the measure of reliability must obviously be based on consistency in speed of u:ork. 'Vhen test performance depends on a combination of speed and power, the single-trial reliability coefficient will fall below 1.00, but it will still be spuriously high. As long as individual differences in test scores are appreciably affected by speed, single-trial reliability coefficients cannot be properly interpreted. 'What alternative procedures are available to determine the reliability of Significantly spl1eded tests? If the test-retest techniqu~ is applicable, it would be appropriate. Similarly, equivalent-form reliability may be properly employed with speed tests. Split-half techniques may also be used, provided that the split is made in terms of time rather than in terms of items. In other words, the half-scores must be based on separately timed parts of the test. One way of effecting such a split is to administer two eqUivalent halves of the test with separate time limits. For example, the odd and even items may be separately printed on different pages, and each set of items given with one-half the time limit of the entire test. Such a procedure is tantamount to administering two equivalent forms of the test in immediate succession. Each form, however, is h¥f as long as the test proper, while the subjects' scores are normally based on the whole test. For this reason, either the Spearman-Brown or some other appropriate formula should be used to find the reliability of the whole test. If it is not feasible to administer the two half-tests separarely, an alternative procedure is to divide the total t,ime into quarters, and to find a score for each of the four quarters. This caneasil~':J;>~ 'done by having the examinees mark the item on which they ar~ w6rkiti~ whenever the examiner gives a prearranged signal. The number of items correctly completed within the first and fourth quarters can then be combined to

Principles of PsycllOlogical '~w,' represent one half-score,

Testing
TABLE

while those in the second and thir~ q~artcrs ," can be combined to yield the other half-score. Such a combmahon of . quarters tends to balance out the cumulative effects of practice, fatigue, and other factors. This method is especially satisfactory when the items are not steeply graded in difficulty level. When is a test appreciably speeded? Under what conditions must the . special precautions discussed in this section be observed? Obviously, the mere employment of a time limit does not signify a speed test. If all subjects finish within the giycn time limit, speed of work plays no part in determining the scores. Percentage of persons who fail to complete the test might be taken as a crude index of speed versus power. Even when no one finishes the test, however, the role of speed may be negligible. For example, if everyone (<()mpletes exactly 40 items of a 50-item .test, individual differences with regard to speed are entirely absent, although no one had time to attempt all the items. The essential question, of course, is: "To what extent are individual differences in test scores attributable to speed?" In more technical terms, we want to know what proportion of the total variance of test scores is speed variance. This proportion can be estimated roughly by finding the ... variance of number of items completed by different persons and dividing '\ it by the variance of total test scores (u·'/r:J't). In the example cited above, in which ev~ry individual finishes 40 items, the numerator of this fraction would be zero, since there are no individuaL differences in number of items completed (u'(' 0). The entire index would thus equal zero in a pure power test. On the other hand, if the total test variance (U2f) is attributable to individual differences in speed, the two variances will .. be equal and the ratio will be 1.00. Several more refined procedures have ;". been developed for determining this proportion, but their detailed consideration falls beyond the scope of this book., . '. An example of the effect of speed on single-trial reliability coefficients is provided by data collected in an investigi~on of the first edition of the SRA Tests of Primary Mental Abilitie.s.~.r Ages 11 to 17 (Anastasi & Drake, 1954). In this study, the reliab!lijY',uf each test was first determined by the usual odd-even procedm:e.;{~;fie~~coefficients, given in the first row of Table 11, are closely sinjil Jhose reported in the test manual. Reliability coefficients were the ..," ,nfited by correlating scores on separately timed halves. These coef1i~~:are shown in the second row of Table 11. Calculation of speed indexes showed that the Verbal Meaning test is primarily a power teSt;,l~i1e the Reasoning test is somewhat more dependent on speed. The Spa.~~,and Number tests proved to be highly speeded. It will be noted iri;1;~h'1' 11 that, when properly com-

11

Reliability Coefficients of Four of the SRA Tesls of Primary MenIal Abilities for Ages 11 to 17 (1st Edition)
(Data from Anastasi & Drake, 1954)

Reliability Coefficient Found by: Single-trial odd-even method Separately timed halves

Verbal Meaning Reasoning .94
.90

Space .90 .75

Number
.92

,96 .87

.83

p~ted, the reliability of the Space test is .75, in contrast to a spuriously hIgh odd-even coefficient of .90. Similarly, the reliability of the Reasoning te,st drops f~on~.. 6 to .87, and that of the Kumber test drops from .92 to 9 .8,3. The rehablhty of the relatively unspeeded Verbal Meaning test, all the other hand, shows a negligible difference whe'n computed by the two methods.

DEPENDENCE OF RELIABILITY ON THE SAMPLE TESTED

COEFFICIENTS

=

7

(1955),

See. e.g .• Cronbach & Warrington (1951 Y,Culliksen (1950a, 1950b), Cuttman Helmstadter & Ortmeyer (1953).

HET~ROG~XEITY. important factor influencing the size of a reliability An coeffiCient IS the nature of the group on which reliability is measured. In ~he. ~rst pla~e, any correlation coefficient is affected by the range of 1I1?~\')?ual dl~erenc:~ in the group. If every member of a group were ah~~ 111spcllmg ablhty, then the correlation of spelling with any other a~lll~y would be zero in that group. It would obviously be impossible;' WI~~1Ilsuch a group, to predict an individual's standing in any other ablhty from a knowledge of his spelling SCOFe. Anot~er, less extreme, example is provided by the correlation between tw~ aptItude tests, such as a verbal comprehenSion and an arithmetic reasonmg test. If these tests were administered to a highly homogeneous sampll:', such as a group of 300 college sophomores, the correlation beI tween the two would probably be close to zero().There is little relationS~i~, wi~hin such a .s~lected s~mple of college students, between any indn Idual s verbal abdlty and hiS numerical reasoning abilitv. On the other hand, wer~ the test~ to. be. give.n to a hetero~neous sample of 300 persons, rangmg f~om mstItut~ona1tzed mentally retar~ed persons to college graduates, a hIgh correlatlon would undoubted:}£,::be obtained betweep the two tests. The mentally retarded would o~ta1.~~hoore.r:scores than tile college graduates on both tests, and similar no{ . hips would hold for other subgroups within this highly heterogeneo'us ',pIe.'>

Principles of Psychological Testing mination of the hypothetical scatter diagram given in Figure 12 urther illustrate the dependence of correlatioll coefficients on the Hity, or extent of individual differences, within the group. This r diagram shows a high positive correlation in the entire, heteroges group, since the entries are closely clustered about the diagonal ding from lower left- to upper right-hand corners. If, now, we cononly the subgroup falling within the small rectangle in the upper -hand portion of the diagram, it is apparent that the correlation bethe two variables is close to zero. Individuals falling within this , icted range in both variables represent a highly homogeneous group, did the college sophomores mentipned above. 'ke all correlation coefficients, reliability coefficients depend on the 'iability of ,the sample within which they are found. Thus, if the reility coefficient reported in a test manual was determined in a group 'ing from fourth-grade children to high school students, it cannot be med that the reliability would be equally high within, let us say, an hth-grade sample. \Vhen a test is to be used to discriminate individual i I
-'~-, I I

Reliability

127

differences within a more homogeneous sample than the standardization group, the reliabi~ity ~oefficient should be redetermined on such a sample. Formulas for estimating the reliability coefficient to be expected when the standard deviation of the group is increased or decreased are available in elementary statistics textbooks. It is preferable, however, to recompute the reliability coefficient empirically on a group comparable to that on which the test is to be used. For tests designed to cover a wide range ~f age or abil.ity, the test manual should report separate reliability coeffiCIents for relatively homogeneous subgroups within the standardization sample.

, I ,

,
i

I I ,, I I , I , I
i

;

!
I

I

i

I

! I
i

i

,
i i

, !,

i
I
"

,

,,
i

,

,
I

I

IIi I

!

;

!

I i

I

,

! I
i

i

,

i ~ , I,
'1/'11

I I iI/
"1'/1

,

11\
i ,I

-h':
i

I

,
i

I I I

,
1

,
,

i

I ill

I ; 1\11'/1,/1
Ifi'll

I
"

/lill,l I jll'/I', 1/1 1'1/;/1/

IIi
i I

i

i
i

,
!
!

I i
I

i
~,

!

,
I

!

I
I

!

,

1

! , ! I
,

:'/'1111, /II I!I: ,

, I ~ I III

/11/1.//,

/I:/!
I

!
I ,
:
I

I I ,II,
.11 11[111/1
;1 1/11/1'

I

!

,
I i
I

I

I
!

1/1 /1/1:/1 I :/1' III /I /1;11, 1
II I
i

I

i I I i
I

I I
!

I:

I '11,/1 11/

, i'i
i

il'"

//I//!//

I

i i i I I , , i

, 1

1'1'
I

I I !1I,II,lI/llIll/ll/ : I ~" 11111I11 //,/1 1/1,11/, ;111/11
I

I

I

,

111 111 111/:1/!1I
I ,/1

i

II

i,l
II

I!
I

,
/ I II:W

I fll
IJII 11·/1

I 11/11/'

I I I

,

i

! I
[I
I

11/ 1/1/ /I'
I

I

,"
III

/I
I

, ,

11 II

1/1/ I II

i
I

I
..

,:~it·
t~i
I
",

".

•.
;", ; 'I;;;l;i.;: 'i

.

II'

,'

,

,
;

I!

,

i i I
I

,
J

,
i

,
,
i

ABILITY LEVEL. Kot only does the reliability coefficient vary with the extent of individual differences in the sample, but it may also vary between groups differing in average ability level. These differences, moreover, cannot usually be predicted or estimated by any statistical formula, b~t c~n ~e' discovere~ .only by empirical tryout of the test on groups d.dfermg 111 age or abilIty levcl. Such differences in the reliability of a smgle test may arise from the faCt that a slightlv different combination of abilities is measured at different difficulty lev~ls of the test. it may result from the statistical properties of the scale itself, as in the StanfordBinet (Pinneau, 1961, Ch. 5). Thus, for different ages and for different IQ levels, the reliability coefficient of the Stanford-Binet varies from .83 to .98. In other tests, reliability may be relatively low for the younger and less able ¥roups, since their scores are unduly influenced by guessing. Under such CIrcumstances, the particular test should not be employed at these levels.

Or

I
I

!

I

,

t~1

,
1

,

i
I

,
1

i

I
i
I

;

~I
I

i i , I L
I I

I'

i!

, i '~?f : ! ,

: I I

:
I

I

i

I
I

I

I

;

..;
.'fo;.,

,

I
I

1

I
i

jfI
/I

I

.....- ...

I

I
1
",'·,1.

I i I
I

I
I
I I

It is apparen.t t~at every reliability coefficient should be accompanied by a fuD descnptIon of the type of group on which it was detelmined. Special attention should be given to the variability and the ability level of the sa~~le. The reported reliability coefficient is applicable only to ~amplef, s~nll]~r to that on which it was computed. A desirable and growlIlg practice In test construction is to fractionate the standardization sample into m~re homogeneous subgroups, with regard to age, sex, grade leve~, occupation, and the like, and to report separate reliability coeffic~ents for each s~bgroup. Under these conditions, the reliability cocHicIen¥ are more lIkely to be applicable to the samples ~~th which the test is to be used ill actual practice. ..

!
I

!

I

I
INTERPRETATION OF INDIVIDUAL SCORES.

Score on Variable 1

.Frc. 12. The Effect of Restricted Range upon a Correlation Coefficient.

expressed in terms of the standard

The reliability of a test may be error of measllre~ent ((fmen.,), also

It can be::~_sg~. Hence. true IQ. standard deviation.p~l. but only e scores obtained in a single test administration.' tells us about a score. For many testing purposes.:. it remains unchanged when found in a homogeneous or a heterogeneous group.The mean of this distribution of 100 scores can be taken as the true . we can choose higher '\lddsthan :2: 1. we call say that the statement would be correct for 99 percent of all the cases. hath computed on the same group. Is Jane more able along verbal than along numerical lines? Does Tom have more aptitude for mechanical than for verbal activities? If Jane scored higher on the verbal than on the numerical sub tests .on an aptitude battery and Tom scored higher on the mechanical than on the verbal. numerical. and the like. if want to compare the reliability of differetlt tests. these :\ scores will vary. It is in terms of such "reasonable limits" that the en-or of measurement is customarily interpreted in psychological testing and it will be so interpreted in this book.n.1t~~Chapter 4 shows that ±3u covers 00. Although we cannot aslll = = = sign a probability to this statement for any given obtained score. we 'V<:mldexpect him to score between 105 ld U5 about two-thirds (68 percent)' of the time. . Expressed in terms of individual scores. Th~s" we can ..h-.58u ras. standard error of measurement can be easily computed from the rehabll: ity coefficient of the test. called tIle standard error of a score. For example. ''Jimwere given 100 equivalent te~ts. it is therefore more useful than the reliability coefficient. To interpret individual scores.7 percent of the cases.and a reliability coefficient of . Such caution is desirable both j when comparing test scores of different persons and when comparing the scores of the same individual in diHerent abilities. If his true IQ is no.scoreand the standard deviation of the distribution can be taken as the . test publishers have been developing report forms that permit the evalua- . To understand what the UI/H'. Reference to Figurei. ilis IQ would fall outside this band 'Valuesonly once.that Jim's IQ on any single admination of the test will lie between"97 an9 123 (110 -13 and no + 13). of an IQ on thIS test IS.33) 5. the error of measurement will not be directly comparable from test to test.-the chances arc roughly 2:1 (or 68:32) that JUllS IQ on . It will be recalled that between the mean and ±lu there are ~pproximatf'ly 68 percent of the cases in. the a"" ••.. The usual problems of comparability of units would thus arise when errors of measurement are reported in terms of arithmetic problems.89. let us suppose that . This measure is particularly wen suited to the interpretation of individual scores.. Gulliksen (1950b. we do not have the true scores. the reliability coefficient is the better measure. changes in scores following instructiun or other experimental \'ariables need to be interpreted in the light of errors of measurement. II. ~"wehad a set of 100 IQ's obtai~ed with the above test by a single boy. if deviation IQ's on a particular intelligence test have a standard devia~iol1 of ~5 ."e = INTERPRETATI01IO OF SCORE DIFFERENCES. li-20) proposed that the standard error of measurement be used as illustrated abo've to estimate the reasonable limits of the true score for persons ""it-h any given obtained score. We can thus state at ttte 99 percent' confidence level (with Iy one chance of error out of l00J. the error of measuren)('nt is independent of the variability of the group on which it is computed.89 15Y.580'n1f. "11Im.ainedrom normal curve fref uenc)' tables that a distance of 2. on either side of . a normal curve. could thc score differences have resulted merely from the chance: se)ection of specific items in the particular verbal..B. is test :_. being reported in score units.:the chances are 99:1 that Jim's will fall within 2.15\/1. If an individual's obtal. Under these circum~nces." 'fluctuate between ± lUIII. It is particularly important to consider test reliability and errors of measurement \\'hen evaluating the differellces between two scores.ll 15(.les. TI~e . by the following formula: . of course..tJim. Unlike the reliability coefficient. The standard error of measurement and the reliabilitv coefficient are obviously alternative ways of exprt'ssing test reliability. from his true"~ore. t. If we want to be more certain of oiI~rprediction. or (2. • Like an\.nclude . On the other hand. or 5 points on either side of his ' Ie IQ. and mechahical tests employed? Because of the growing interest in the interpretation of score p'rofi. A frequent question abollt test scores concerns the individuars relative standing in different areas. 'In actual practice.in which al is the standard deviation of the test scores and '11 the reliability coefficient.Because of the types of chance errors discussed in this chapter.tti{ee.58)(5) 13 points.58O''''r ••. words in a vocabulary test.·~i!4er side of the mean includes 'actly 99 percent of the cases.Principles of PsycllOlogical Testing Reliability U9 . we could argue that his true score lie within 2.. Similarly. how sure can we be that they would still do so1on a retest with another form of the battery? In other words.58?:7. . olflis obtained score. On the basis of this reasoning. falling into a normal distribution around Jim's true ':score. this standard error can be interpreted in t~rms of the normal curve frequencies 'discussed in Chapter 4 (see Figure 3).~t. -v .?~.we could try to follow ~t. pp. Thinking in terms of the range within which each score may fluctuate serves as a check against overemphasizing small diHerences between scores. the standard error of measurement is more appropriatc..'.above reasoning in the reverse direc.score is unlikely to deviate by more 2.

as follows~ . 1 RELIABIUTY OF CRITERION-REFERENCED TESTS It will be recalled from Chapter 4 that criterion-referenced. 73. reproduced in Figure 13. ...) tion of scores in terms of their errors of measurement.05 level.• . 1 Flc.. Illustrating Use of Percentile Bands. 13.96 and .. or approximately 10 points. since their scores would have to be expressed in terms of the same scale before they could be compared. WAIS deviation IQ's have a mean of 100 and an SD of 15. . 1974 by The Psychological Corporation. In iI'l~.• are the standard errors of measurement of the separate scores.Y. An example is.Reliability 131 :RAWSCORE PERCENTILE I~~ w. I l'~~'ll l~:: ft . that bctween Mechanical Reasoning and Space Relations probably does not.. .. tests usually (but not necessarily) evaluate performance in terms o(~ mastery rather than degree of achievement.93. -. in which Udi//. the actual ranges covered by the one-inch lines are not identical.. Hence the standard error of the difference between these two scores can be found as follows: Udif/.~~~ 60 9S 80" 95 30 80 90 'l9 85 i .. Copyright ® 1973.~~e Tests.70. ~~\ <. ..93 = 4.-1 . By substituting SDyll .~rp.05 level. respectively. In this substitution. we may rewrite the formula directly in terms of reliability coefficients.time.~ I. we multiply the standard error of the difference by 1. but they are sufficiently close to permit uniform interpretations for practical purposes.-~:.. 8 Because the reliability coefficient (a¥d hence th~ er•• .. the difference between Abstract Reasoning and Mechanical Reasoning is in the doubtful range. . = 15y12 . The standard error of the diffe. .) and Urneas .t. The result is 9. grade. Score Profile on the Differential Aptitude Tests.96. In the profil~%tl!ustrated~~f~gure 13. ) varies somewhat with subtest.95 chance (4. the Individual Report Form for use with the Differential Aptit.. 2. . ~ - ". percentile scores ?~ each subtest of the battery are plotted as one-inch bars.) and SDyll . ••. test users are advised not to attach Importance to olfferences between scores whose percentile bars overlap. '\\1th the ~1:l~jPed percentil~ ' at the center. Fifth Edition Manual. This follows from the fact that this difference is affected by the chance er1"Ors present in both scores.TII for Umeus.:.8 Hence the assumption that the mdlVl~ua! s true ~~~allS Wlthm the bar is correct about 90 percent oftl. p.ide ~f 't~i!o~t~ine? ~core. the same SD was used for tests 1 and 2. Each percentile bar corresponds to a dist~nce of approxImately 1 Y2 to 2 standard error~ o~ :ithe~' ~. . Reproduced b)' permission.96 . \Ve may illustrate the above procedllfe with the Verbal and Performance IQ's on the Wechsler Adult Intelligence Scale (WAIS).tetingthe profiles. . (Fig. Thus the ence between an individual's WAIS Verbal and Performance IQ be at least 10 points to be significant at the . .especially if they overlap by more than half their length.r2lI for Umeas . .: ".. and sex. . ·1. for example.95) differshould To determine how large a score difference could be obtained by at the . and Umca8. New York. I~~ '-.ence between two scores can be found from the standard errors of measurement of the two scores by the follOWing formula: '" ~ ~60 ~~ 50 u 40 30 .... ~ the difference between the Verbal Reasoning and Numerical Ability scores probably reflects a genuine difference in ability level. 0 : 0 .. It is well to bear in mind that the standard error of the difference between two scores is larger than the error of measurement of either of the two scores. All rights reseT\'ed. On this form. N. The splithalf reliabilities of these scores are . A major statistical implication of . is the standard error of the difference between the two scores.

Lindgren & :'.·" ·'d. Educational Testing Service has followed an empirical procedure to set standards of mastery. (2) rejcct the hypothes!s.. As the vatiability of the sample decreases. to develop appropriate statistical techniques that will provide objective. it is also built into the tests through the construction and choice of items. mastery decisions reached at a prerequisite instructional level can be che{:ked against performance at the next instructional level.::fls~nted.tes of mastery (Glaser & Kitko.(3~f~~ake add~tional o~serYatlOns. even a highly stable and internally consistent tcst could yield a reliability coefficient near zero. Is there a sizeable proportion of students who reached or exceeded the cutoff score on tIle masten' test at .:fo~he next instructional level or returned to the nonmastered level '0 . including reliability coefficients. procedures have been developed for incorporatinO' collateral data from the student's previous performance history. so does the correlation coefficient.o Under thes. further study. 197].. conditions. we wish to test the hypothesis that the examinee has achieved the required le"el of mastery in tllP content domain or instructional objective sampled by tne test items.dntimieS. Either the addition of more items or the establishment of a higher cutoff score would seem to be indicated. A few examples will serve to illustrate the nature and scope of these efforts. as will be shown in Chapter 8. Theoretically. Allstatistical procedures for use with criterion-referenced tests are in an exploratory stage.eh lend themselves well t~ the kind of decisions required by. Hambleton and Novick (1973).. 1947).1973). 1973. 1969. A cutting score. 197:3. Much remains to be done. 19i2. Specifically.1cElrath. or . Thus the number of observations (m. Livingston. ] 973 ). 1973). Millman (1974).fhls case :t:lytnber of items) needed to reach a reliable conclusion is. Obviously. Wald. For example.hom the same decision (~mstery or nonmastery) is reached on both forms (Hambleton & No\'Ick. Segucntial analysis consists in taking observations one at a timE' and deciding after cach observation whC'f. is then selected that best discriminates between the two groups. In the construction of criterion-referenced tests. 4 . then. In the development of several criterion-referenced tests. 1971).. The dichotomization can be fmther rcGned by usmg teacher Judgments to exclude any cases in the lower grade knoVl'll to have mastered the concept or skill and any cases in the higher grade who have demonstrably failed to master it. such sequential decision pro9 For fuller discussionof special statistic. these two questions have been answered by judgmental decisions. is affected by the variability of the group in which it is computed. A set of tables for determining the minimum number of ~lems required for establishing mastery at speCified levels is provided by Millman (1972. we saw that any correlation. Because of the large number of specific instructional objectives to bc t~sted. ce~ures ar~ feasible and can reduce total testing time while yielding rehable ~stlma. predetermined number of items the examine~~c. 1971. Not only is low variability a result of the way such tests are used.:ith a fixed. \Vith the computer facilities described in Chaptn_~. Glaser & Nitka. Millman.ltaking tbe test until a mastery or nonmastery d~cision is r~·.\" ~rocedures required for the construction and evaluationof criterion-referencedtests. however. Hambleton & l\ovick.test. When flexible. two important questions are: (1) How many items must be used for reliable assessment of each of the specific instructional objectives covered by the test? (2) "What proportion of items must be correct for the reliable establishment of mastery? In much current testing.'o question~ about number of items and cutoff score can be incorporated into a single hypothesis.g.ical ~ryouts. Hambleton & Novick. Ferguson & i\ovick.tper to: (1) accept the hypothesis. as well ~s from the test results of other students (Ferguson & !'\oviek.. before the most effective IJlethodology for different testmg situatlons can be formulated. in terms of number or percentage of correct items. e. testing is discuntinue'd and the student is either dire '. itself deten~nined during the process of testing. these findings would strongly suggest that the mastery test was unreliable.see Glaser and Nitko (1971). it would be inappropriate to assess the reliahilitv of most criterion-referenced tests by the usual procedures. individually tailored procedmes are impracticable.13Z Pl'inciplt:s of Psychological Tcstillg Reliability 133 mastery testing is a reduction in yariability of scores among persons.. In an earlier section of this chapter. . empirical answers (see. To supplement this limited information. At that point. I~ore traditional techniques can be utilized to assess the reliability of a gl\'en . Popham and Husek (1969). amenable to ~testillg within the framework of decision theory and sequential analysis (Glaser & :\'itko. Rather than being p. in both theoretical develoIJ!nent and ~mpir. 1974). This procedure involves administering the test in classes one grade below and one grade above the grade where the particular conce?t or skill i~ taught. variability is reduced to zero. whi.the lower level and ~ailed t~ achi~\'e mastery at the next levei within a reasonable period of mstructlOnal tU1W?Does an analysis of their difficulties suggest that they had not truly mastered the prerequisite skiIIs:l If so. Some Investigators have been explorinO' the use of Ban'sian estimation techniques. ma~tery testmg. Efforts are under way. 1973. criterion· referenced tests typically provide only a small number of Itcms for cach objective. if everyone continues training until the skill is mastered. Another procedure for determining the reliability of a master)' test is to administer two parallel forms to the same individuals and note the percentage of persons for ". The t.

and construct validity. a test can easily become overloaded with those aspects of the field that lend thcmselves more readily to the pl'eparation of objective items. I Further discussions of content validity from several angles ca.. Still another difficulty arises from the possible inclusion of irrelevant factors in the test scores.a . Mere inspection of the test may fail to reveal the processes actually used by examinces in taking the test. such as the application of principles and the interpretation of data.what. Content validity involves essentially the systematic exami~ation of the test content to determine whether it covers a representative NATURE. Knoell & Harris. and in the correct pro~r example. 1950). Fundamentallv all procedures for determining test validity are concerned with the 'r~lationships between performance on the test and other independently observable facts about the behavio~ ehar~cte~stics under consideration.' (1974). But it cannot be assumed that such a test also measures ability to spell correctly from dictation. Each o~ tnese types of valIdatIon. A \VeIl-constructed achievement test should cover the objectives of instruction. convenient labels for IdentificatIon purposes. Its validity must be determmed WIth reference to the' particular use for. It might thus appear that mere inspection of the content of the test should suffice to establish its "a1idih' for such a purpose.ve 'hl~h or "low" validitv in the abstract. T · HE VALIDlTY of a test concerns u. sample of the behavior domain to be measured. also important to guard against any tendency to overgeneralize regarding the domain sampled by the test. frequency of misspellings in written compositions. a test designed to measure proficiency in such areas as mathematics or mechanics may be unduly influenced bv the ability to understand verbal directions or by speed o{performing si~ple. and other aspects of spelling ability (Ahlstrom. and the relations amona them will be examined in. respectively. is not so simple as it appears to be.tes. For educational Content validity is built into a test from the outtests. The solution. the prepfaration of items is preceded by a thorough and systematic examinati'Qn of relevant course syllabi and textbooks.Validity: Basic Concepts 135 HAPTER 6 . No test can be said to ha. procedures will be considered in one of the .' Onc difficulty is that of adequately sampling the item universe.Basic C011cepts .fgll?c'~ir:!g. In the Standards for Educational and PsycJlOloglcal Tests.~ as well as factual knowledge.~ . The behavior domain to be tested must be systematically analyzed to make certain that aJJ major aspects are covered by the test iteme. or bookkeeping items. the vahdlty of . section~. spelling. however. content validity depends on the relevance of the individual's test responses to the behavior area under consideration. Moreover. Such a validation -procedure is commonly used in evaluating achievement tests . the ~est measures. a multiple-choice spelling test may measure the ability to recognize correctly and incorrectly spelled worde. as well as by consultation SPECIFIC PROCEDURES. It is. the objective sources of information and empirical operatIOns ut~li~ed In establishing its validity (Anastasi. spelling. rather than being defined after the test has been prepared.alidity: . The specific methods ·employed for mvestIgatmg these relationships are numerous and have been descri~ed by various names. we should guard against ae- cepting the test name as an index of . or bookkeeping '~'ould seem to be valid by definition if it consists of multiplication. For example. Techniques for analyzing and intcrpreting vali1~tt "data with reference to practical decisions will be discussed in Chapter 7. 1964. Test names provide short. . which the test is being considered.. . . A test of multiplication. and Lennon (1956).. routi~e tasks. although increasing e£forts are being made to use more specific and operationally definable test names. Huddleston (1956). Content must therefore be broadly defined to include major objectives. ~he ~rait measured by a given test can be defined only through an e~amIna~l~n of..concludmg section. The domain under consideration should be fully described in advance. This type of test is designed to measure how well the individual has mastered a specific skill or course of study. ~vloreover. For instance.lwf the test measures and how wen it does so. rather than on the apparent rcle\'ance of item content. cannot be reported in general terms.n be found in Ebel (1956). set through the choice of appropriate' items. 1952). not just its subject matter. Most test names are far too broad and vague to furnish meaningful clues to the behavior area covered. criterion-related. these procedures are classified under three prineip~~"categories: c~l1t~nt. In this connection.

'" "'Oeo "''''''' N.l.. methods. 1964). ."INN "''''0 JaqwnN wall -N'" "''''(0 "'COOl O~N ~~~ ~~~ "'O~ ~NN "1M. and evaluation. llj5!l:f% CONO ~ '" .. """-CX) r--... <t ••• .• .. the preparation of teacher-made examinations for classroom use in any ubject..'" ~::~ "'.Principles of Psychological Testing with subject-matter experts.•••. as well as extent of agreement among judges.. and appreciation.. inrests. Krathwohl et al... Both total s and performance on individual items can be checked for grade ess.r.. The jor categories given in the cognitive domain include knowledge (in sense of remembered facts. M •. the instruc'onal objectives or processes to be tested. the number of items of ach kind to be prepared on each topic can be established.•..... "I"'''' < ... These specifications should show the content areas or topics to be covered.. ..•.. The classification affective objectives. '" •. co •••. " " " I 3A!leJJeN iI " "" "" 5Cl'lpn~s '0 le!XlS " is 'p u...application...•.!~eJJeN " "'''' •.. ""'" ~ "'''' . "" ~~~ --~ "'o~ ~NN "''''. IThediscussion of content validity in the manual of an achievement test uld include information on th~ content areas and the skills or obives covered bv the test.•• LllCO •.. Information should likebe provided about number and nature of 'course syllabi and texts surveyed.eh... need to have items. <0 ...!leJJC!N I" " " sa!P01S II?POS 0 'u. "INN 0:>0>0 Lll"'''' NNN NNM U!Pn&S IU!~OS 'u. those items are retained that show the largest gains percentages of children passing them from the lower to the upper JeqwnN wall ~N'" .. Two volumes are ilable.. with some indication of the number of items ach category.to COOOl o "'ON ".'" "' •. it is paI:tJcularly desirable to give the dates n subject-matter experts were' consulted. "'~Lll NM.•.. " E 0... On the basis of the information thus gath'-ered. E " x Iii . Prepared by a group of specialists educational measurement..•.:rl"'. ~Jn listing objectives to be co\'ered in an educational achievement test. umber of empirical procedures may also be followed in order to ement the content validation of an achievement test..!) "''''''' "'. t might be added that such a specification table will also prove helpful .. ~ 'M. "'NN N"'''' NNM "'''' . and the relative importance of 'ndividual topics and processes. If subject-matter experts ipated in the test-construction process.. of course... concerned with the modification of attitudes... If they served as judges in classi. .. 0. Because curricula and course eilt change over time.lllCO Nmq-10l!:t~LO " MNO . .~£~ ~ 0 6 oiIpt'!JE) coco". "'N.M -~•.. On this basis.....nee certain processes may be unsuitable or irrelevant for certain topics..). and characterization. synthesis. ~Z '4D!1l% gaP'!J~! ~~. 1956.CO'" "'''I''' "''''''' "'''''''' ~.test specifications are drawn up for the item writers... responding. organization. the procedures followed in selecting cate. respectively. :> ~ a3u:a!3S ~ is ~ samuewnH -. ••• N"'. includes five major categories: recciv'g. with ocesses across the top and topics in the left-hand column (see Table .. yaluing... . c 0 '" ~~filll!'l~. ••.. covering cognitive and affective domains... •.. . their number and prolal qualifications should be stated. terms. etc.. 'In addition.•.. N"'<t "l"'''' COOlN "--N "'0>.. principles.... a.. 0 a:>Ua!~S "" " " : " x " " 5a!)!lJcwnH I.) . the directions they were given should be reported. LllMcn "'N~ "I•... :6- eou81OS x " " "" x S3!l!UllWnH .In general.. this handbook also provides examples of any types of items designed to test each objective.. s and classifying items should be described. f.. analysis... •. Not all cells in such a table. compresion.-CD •. values. e test constructor can be guided by the extensive survey of educational jectives given in the Taxonomy of ~ducational Objectives (Bloom a!. ~~g LllNN 11l5!1l~ L ape.•. including publication dates. A convenient ay to set up such specifications is in terms of a two-way table. 14)."'''I''' coo"..•.•.. MLll CO..'C a" f! .... items.

silly. face validity itself is a desirable feature of tests. to .mechanic would arrive at the same solution in terms of spatial visualization. Content validitv should not be confused with face validity. (1) Does the test 'cover a representative sa~ple of the speCified skills and knowledge? (2) Is test performance reasonably free from the influence of irrelevant . Such hypotheses need to be empirically confirmed to estabiish the validity of the test. The latter is not validity 'in the technical sense. it refers. To detect the possible irrelevantinfluence of ability to read instructions on test performance. inappropriate. a thorough · job analysis should be carried out in order to demonstrate the close re· semblance between the job activities and the test.terpreted in tern1S of content meaning. Although considerations of relevance and effectiveness of content must obviously enter into the initial stages of constructing any test. Fundamentally. they frequently met with ~esistance and criticism because of their lack of face validity. This type of validation issuitable when the test is an actual job sample or otherwise calls for the sameskills and knowledge required on the job. eventual validation of aptitude or personality tests requires empirical verification by the procedures to be described in the following sections. when tests originally designed for children and developed within a classroom setting were grst extended for adult use. The latter ld be done by testing students individually with instructions to "think ud" while .~res on the test can be ~rrelated \". not to what the test actually measures. It permits us to answer two questions ihat are basic to the validitv of an achievement test. when appriate. or childish. the administrative personnel who decide on its use. regardless of the actual validity of the . Although common usage of the term validity in tlhs connection may make for confusion. the information provided des its classification with regard to learning skill and type of mal. generally regarded as a typical "codelearmng test. it would be virtually impossible to determine the psychological functions measured by the tcst from an inspection of its content. giving the questions v. Especially when bolstered by such empirical checks as thoseilIusb'ated above. This test. For example. In such cases. Content validation is also · applicable to certain occupational tests designed for employee selection and classification.dity provides an adequate technique forevaluating achievement tests. content vali. while a. On the other hand. be misleading. The 30 items included in Figure 14 repret onepart of the Reading test for Level 3.as well as the percentage of children in the normative sample who the right answer to the item in each of the grades for which that of the test is designed. on the other hand. Face validity pertains to whether the test "looks valid" to the examinees who take it. in the latter tests. The contribution of speed can be ckedby noting how many persons fail to finish the test or by one of e more refined methods discussed in Chapter 5. Hence. The identical test might thus measure different functions in different persons. Consequently. and other technically untrained observers.be discussed in Chapter 15. testsdescribed in Chapter 4. was found to measure chiefly motor speed in a group of high school students.solving each problem. 1950). it is obvious that content validity ~ is a prime requiremenf for their effective use. Under these conditions. FACE "ALIDITY. the content of aptitude and personality tests can do little more than reveal the hypotheses that led the test constructor to choose a certain type of content for measuring a specified trait. the question of face validity concerns rapport and public relations. Certainly if test content appears irrelevant. college graduates might solve a problem in verbal or mathematical terms.Figure 14 shows a portion of a table from the manual of the ential Tests of Educational Progress-Series II (STEP). but to what it appears superficially to measure.ith scores on a reading comprensiontest. the APPLICATIONS. in fact. Content validity is particularly appropriate for the criterion-refer~n~d . if the test is designed to measure readg comprehension. in each test in this achievement battery. A specific illustration of the dangers of relying on content analysis of aptitude tests is provided by a study conducted with a digit-symbol substitution ~est"(Burik. Or a test measuring arithmetic reasoning among high scho. These tests bear less intrinsic resemblance to the behavior domain they are trying to sample than do achievement tests. content validity is usually inappropriate and may.ol freshmen might measure only individual differences in speed of computation when given to college" students. ther supplementary procedures that may be employed. which covers grades 7 to 9. aptitude and personality tests are not based on a specified course of instruction or uniform set of prior experiences from which test content can be drawn. For aptitude and personality tests.Principles of Psychological Testing Validity: Basic Concepts 1. For every . individuals are likely to vary more in the work methods or psycholOgical processes employed in responding to the same test items. For example. include analyses of t~l)es of errors commonly made on a test observation of the work methods employed by examinees. Because performance on these tests lS 111f .39 es. Unlike achievement tests. the result will be poor cooperation..oithout the reading passage on hich they are based will show how many could be answered simply from examinees' prior information or other irrelevant cues. \Janables? ~. .

For t~is purpose.solutIOn. not on hme. Thus. Because ~he criterion for concurrent validity is always available at the hme of testmg. concurrent validity is found merely as a su~stJt~te for predictive validity. it is not sufficie~t for a t~st to. if a college ill<st. without necessarily altering the functions asured. and for a neuroticism test. i.its objective validity remains unaltered.. .a number of instances.ev~nt to tests employed for diagnosis of existing status. In . It is in the latter sense that it is used in ression"'predictive validity:' The information provided by prevalidityis most relevant to tests used in the selection and dasn of personnel. the criterion might bc subsequent job ormanceas a machinist. Similarly. a test that could sort out normals from neurotic and ' ?oubtful cases would appreciably reduce the number of persons requirmg such extensive observation. The criterion measure against test scores are validated may be obtained at approximately the . ization period. it might be associates' ratings . a direct and indent measure of that which the test is deSigned to predict. The difference can be illustrated bv asking: "Is Smith neurotic?" (concurrent validity) and "Is Smith lik"ely to become neurotic::>"(predictive validity) .htute for the criterion data. peranceon the test is checked against a criterion. therefore. It cannot be assumed that im\1ng the face validity of a test '\vill improve its objective validity. or less ex~ensive subs.To be sure. an arithmetic test for naval personnel can be ressedin naval terminology. mechanical aptitude test. or in the more limited sense of 'on over a time interval. The logICal dI~tinchon between predictive and concurrent validity is ?ased.:T AND PREDICTIVE • ~RITERION CO~TAMINATION. on the other hand. the items should be ded in tcrms of machine operations rather than in terms of "how y oranges can be purchased for 36 cents" or other traditional schoolk problems. we might ask what function is served bv the test in such situa~ions. tests are administered to a group on whom cntenon data are already available.Face validity can often be improved by merely reformulating test msin terms that appear relevant and plausible in the particular setting whichthe" will be used.Especially in adult testing. For certain uses of psychological tests. Concurrent validity ISrel. Such mHuences would obviously raise the correlation between test scores and crite~on in ~ manner that is entirely spurious' or <ilrtificia1:. rather than predIction of future outcomes. . or those of employees compared with their current Job success. Or a hIgh-scoring person might be given the benefit of the doubt ~hen academic grades or on-the-job ratings are being prepared. differentiate between concurrent and predictive validthe basis of these time relations between criterion and test. VALIDITY. .. concurrent validity ~sthe ~~st ~pprop!iate type and can be justified in its own right. such tests provide a simpler. As a comprom~se . For example.140 Principles of Psychological Testing Validity: Basic Concepts 141 ~st.metor or a foreman III an mdustnal plant knows that a particularillai~ VIdual scored very p~rly on an aptitude test.' TIus pOSSIblesource of error in test validation is known as criterion .owl~qgemight influence the gr~de gIVen to the student or the rating assigned to the worker. but on the objectives of testing. The validity of the in its final form will always need to be directly checked.. status. if a test of simple arithmetic soningis 'constructed for use with machinists. For example. An essential precaution in finding the vahdlty of a test IS to make certain that the test scores do not themselves influence any individ~ars c~terion. It also needs face validity to function effectively In pracal situations. It is frequently impracticable to extend vah~atlon ~rocedures over the time required for predictive validity or to o~tam a s~Itable preselection sample for testing purposes. to refer to prediction he test to any criterion situation. Hiring job applicants. The rediction"can be used in the broader sense. face validity should never be regarded as a substie for objectively determined validity. for a scholastic aptitude test. B~sicalIy. if the criterion conSIStsof continuous observation of a patient during a two-week hospital. o riterion-relatedvalidity indicates the effectiveness of a test in predictan individual's beha\'ior in specified situations. be obctivelyvalid. sonnel to occupational training programs represent examples of the sort of decisions requiring a knowledge of the predictive validity of tests. r can it be assumed that when a test is modified so as to increase its e validity. selecting students for onto college or professional schools. Other examples include the use of tests to screen out applicants likely to develop emotional disorders in stressful environments and the use of tests to identify psychiatric patients most likely to benefit from a particular therapy. For example. such lcIl. Thus. . the test scores of college stud~nts may b~ compared with their cumulative grade-point average at ~he tIme of testmg. The APA test ·urds (1974). and assigning military per'CURREI'.her available information on the subjects' behavior in various life lions. quicker. time as the test scores or after a stated interval..e. it might be ge grades.

the ultimate criteria would be combat perfo~mance a~d eventual achievement as a practicing physician. or bookkeeping. achieveest scores.aptItude battenes have often been checked against grades in spec. Several professional aptitude tests have been validated In terms of achievement in schools of law medicine dentistry. and oth. whether il~ly ultimate criterion is ever obtained in actual practice. It is doubtful. and other line personnel that such a precauential. To prevent the operation of such an s absolutely essential that no person who participates in the asof criterion ratings have any knowledge of the examinees' test or this reason. fall into a few common categories. scores on a verbal comprehension test may be compared with grades in English courses spatial visualization scores with geometry grades. or art. represent a more highly MON CRiTERIA. It is sometimes difficult to convince teachers. Various business school courses. and other nonintellectual factors may influence the continuation of the individual's education. An outstanding illustration is the validahO~ ~f Au Force pIlot selection tests against performance in basic flight tr~m~g. The assumption underlying this crite . To what extent ~re the obtain~d differences in intelligence test scores simply the result of the yarymg amount of education? And to what extent could the test have predicted individual differences in subsequent educational progress? These questions can be answered only when the test is administered before the criterion data have matured. The cindicesused as criterion measures include school grades. It is for this reason that such tests have often ore precisely described as measures of scholastic aptitude. I. for example. and succ~ssful co~pletjon of. respectIvely. instructors' ratings. as in predictive validation. from the primary grades to college and uateschool. Although employed principally in the validation of genintelligence tests.Validity: Basic Concepts rillciplesof Psychological Testing tion. special honors and as. e various indices of academic achievement have provided criterion at all educational levels. the relation between amount of education and scholastic a titnde is far from erEect. It is expected that in general the more intelligent individuals inutl their education longer. they may be properly ed with the criterion of academic achievement. military officers. eliminating oseincapable of continuing beyond each step.lli ent drop out of 01earlier.ds are ~ f:equent ~ource of ~riterion data.such persons may fail to realize that the test scores e put aside until the criterion data mature and validity can be d. while the less inte. they have also served as criteria for certain 'pIe-aptitude and personahty tests. For example. Among the criteria equendyemployed in validating intelligence test~ is some index of ic ac ' t. in order to determine their validity as dIfferential predictors. Any method for assessing behavior in tion could provide a criterion measure for some particular purhe criteria employed in £ndif\g the validities reported in test Is. even were such an ultimate criterion available. s. Although it is undoubtly true that college graduates. EspecIa y at t e Ig er e ucationallevels. a :~on criterion is freshman grade-point average. promotion and graduation records. In the validation of any of these s. aptitude tests. ObVIOuslyit would require a long time for such criterion data to mature. provide criteria for aptitude tests in these area's. that the educaal ladder serves as a progressively selective nee. moreover. Moreover. each grade g weighted by the number of course points for which it w. Any test may be validated against as many criteria e are specific uses for it.. l\ful~lple . for example. formally assigned grades. a frequent type of criteno~ is bas~d on performance in specialized training. a useful distinction is that between intermediate and ultimate criteria: In the development of an Air Force pi!Pt-selection test or a medical aptItude test. This measure is the ." Insofar as ratings given within an acade~ic setting are likely to be heavily ~dby the individual's scholastic performance. motivational. SlIl~Ilarly.oftests for use in the selection of college students. and so forth. test scores employed in "testing the test" must rictlyconfidential. social. In the case of custom-:nade tests' deSigned for use within a specific testing program. variant of the criterion of academic achievement frequenl:ly emedwith out-of-school adults is the amount of education the individual pleted. and teachers' or instructors' ratings for "intelligence.p~rformance in music or art schools has been employed in validatmg musIc. t~l~g. Finally. training versus elimination from the program. For example. In their urgency to utilize all available information for decisions. it would probably be subject to many unconttolled . with such concurrent validation it is difficult to disentangle cause-and-effect relations. training reco. for example.n t~e development of special aptitude tests. economic.however.IRehIg? school or college courses.completion of training. engineering. such as stenographY. Performance in training programs is also commonly used as a ~ntenon ~or test validation in other military occupational specialties and m some mdustrial validation studies.r areas. ~mong the specific indices of training performance employed for critenon purposes may be mentioned achievement tests administered on .a~~ceived. ' In connection with the use of training records in general as criterion measures.age grade in all courses taken during the freshman year. 143 selected group than elementary school graduates. since the criterion ratings become "contaminated" by the owledgeof the test scores. mechamcal aptitude tests may be validated against final achievement in sho~ courses.

F'or these reas. Under these circuwstances. ible during training. on the one hand. at least provide good interiate criteria for many testing purposes. The ratings discussed earlier represent~d merely a SUhsldI~ry tec?mque for obtaining information regarding such criteria as academiC achIevement. or many purposes.of traits that psychological tests attempt to measure.the criterion of job puformance is likely to entail a loss m the mber of available subjects. and ratmgs by co-workers.es. it would be cult to evaluate the relative degree of success of physicians practicing erent specialties and in different parts of the country. cou~s. as well as ~n the preparation of attitude scales. . ratings of students by school counselors. The criterion under cons~deralionis thus more complex and less clearly definable than those preVIously discussed. or honesty. :Ve are now considering the use of ratings as the very core of the cntenon mea~ur~. Similarly. an~ jo~ supervisors. respectively. such as the Strong Vocational Interest Blank. the subjects in the vali~\ltion sample might be ~ate? on such c?aracteristics as dominance. college students who hav~>:engaged in man~ . . On the other hand. contrasted groups can be selected on the basis of any cnterion. or ot~er spccial groups generally knO\vn to represent distmetly dIfferent pomts of "iew on certain issues. Thus. To be sure. such.':":" Ratings have bee~ employed in the valid~tion of. classmates. of certain types of ratings by school teachers. religious. Moreover. The "jobs" in question may vary widely in both I and kind. for specific jobs. fellow club-members and other grou?~ of associ~tes. For example.withina particular group versus elimination therefrom. Most measures of job performance. since it usually involves a l?nger low-up.y~hiatric diagnosis may serve as a satisfactory criterion proVIded that It IS based on prolonged observation and detailed case history rather than on a cursory psychiatric interview or examination. Oc~up~tlOnaI groups have frequently becn used in the development and vahdahon ?f interest tests. or job performa!1ce. tors that would render it relatively useless. or job succe~s. by simply choosingthe extremes of the distribution of criterion me~sures.ls who hav~ entered and remained in such occupatiq9~:~s selling or executive work Will as a group excel persons in such fiela~['ils clerical work or engine. For example.icaI ingenuity. but involve a personal judgment by an observer regardmg any of the variety. leadership. Ps. performance in specialized training. are disti?ct groupsthat have gradually become differentiated through the operation ofthe multiple demands of daily living. such as school grades. ho}'-'wer. Moreover. the test perform~nce of salesmen or executives. m the validation of attitude scales include political. This criterion been used to some extent in the validation of general intelligence as as personality tests. They are partICularly useful in providing criteria for personality . the multiplicity of factors etermining commitment to an institution for the mentally retarded conitutes the criterion. in validating a test of social traits. origmali~.lltmost every type of test. with reference to man v social traits individua. in connection with other criterion cate.ering. Other groups sometimes employed. although probnot representing ultimate criteria. ed mentally retarded children may be compared with those obtained y schoolchildren of the same age. there is no reason to expect the psychiatric diagnosis to be supenor to the test score itself as an indication of the individual's emotion~l ~ondition. Mention has already been made. latter..similar jobs in different organizations. such intermediate criteria as performance records at some stage of iningare frequently employed. the professions. To these can be added ratings by offic~rs 10 mIhtary sltuahons. the most satisfactory type of criterion measure is t based on follow-up records of actual . and to a larger extent in the validation of special tude tests. Similarly. the scores obtained by institution~l. ratings. but rather as an indicator or predictor whose own validIty would have to be determined. In this respect they are to be erred to training records. on the other. instructoml in speclahzed. Thus. the measurement of perform. Such a psychiatric diagnosis could not be regarded as ~ c:ltenon measure.as criterion measures.nt of certain personality t~sts. psychiatric diagnosis is used both as a basIS for the selection of items and as evidence of test v~lidity. may be compar~d WIth that of clerks or engineers. industry. ip e validation of an intelligence test.?o~ies. test manuals reporting 'ditydata against job criteria should describe not only tbe specific terion measures employed but also the job duti~s performed by the rkers.ob performance. the validity of a musical aptitude or a echanical aptitude test may be checked by comparing the scores obtained by students enrolled in a music school or an engineering school. ~eograp~lCal. including work in business. and armed services. The assmnption' underlymg such a procedure is that. case.extracl~rncular activities may be compared with those who have partlcIp~ted 111 nOlle during a comparable period of college attendance.mce does not permit as much uniformity of conditions as is . In this case. In th. The method o~ contrasted groups is used quite commonly in the validahon of persollahty tests. It is a common criterion in the validation of custom-made . Validation by the method of contrasted groups generally involves a compositecriterion that reflects the cumulative and uncontrolled selective j~fluencesof everyday life. the ratings themselves define the CrItenon. Because of the variation in the nature of inally . The contrasted groups included in the present category. This critcrion is ultimately based on survi"al . In the developmc. mech~ll.:ratings are not restricted to the evaluation of speci~c achievement.with the scores of unselected high school or college student~.4 Principles of Psychological Testing Validity: Basic Concepts 145 .

doubtful. Any method for assessing behavior in ation could provide a criterion measure for some particular purhe criteria employed in finding the validities reported in test Is. test scores employed in "testing the test" must rictlyconfidential. training reco. ObVIOuslyit would require a long time for such criterion data to mature: It is. the relation between amount of education and scholastic a titude is far from erfect.'s absolutely essential that no person who participates in the ast of criterion ratings have any knowledge of the examinees' test or this reason. a on criterion is freshman grade-point average.ing. they have also served as criteria for certain tiplc-aptitude and personality tests. promotion and graduation records. The cindicesused as criterion measures include school grades. a useful distinction is that between intermediate and ultimate criteria: In the development of an Air Force pilpt-selection test or a medical aptitude test. ge grade in all courses taken during the freshman year. aptitude tests. Finally. It is expected that in general the more intelligent individuals tinue their education longer. An outstanding illustration is the validahO~ ?f Alr Force pllot selection tests against performance in basic flight tr~m~g. mechamcal aphtude tests may be validated against final achievement in sho~ courses. formally assigned grades. Although employed principally in the validation of gen.aphtude battenes have often been checked against grades in spec. represent a more highly MON CRiTERIA.such persons may fail to realize that the test scores e put aside until the criterion data mature and validity can be 143 d. t~l~g. For example." Insofar as ratings given within an acade~ic setting are likely to be heavily . Various business school courses. Although it is undoubtly true that college graduates. It is for this reason that such tests have often ore precisely described as measures of scholastic aptitude . and teachers' or instructors' ratings for "intelligence.ds are ~ f:equent ~ource of ~riterion data. motivational.. variant of the criterion of academic achievement frequently emyedwith out-of-school adults is the amount of education the individual pleted. ' In connection with the use of training records in general as criterion measures. Performance in training programs is also commonly used as a ~ntenon for test validation in other military occupational specialties and m some industrial validation studies. each grade g weighted by the number of course points for which it waJJ~ceived. even were such an ultimate criterion available.n t~e development of special aptitude tests.uateschool. the ultimate criteria would be combat perfo~mance a~d eventual achievement as a practidng physician. It is sometimes difficult to convince teachers. To prevent the operation of such an . I. while the less int~lli ent drop out of 001 earlier. however. For example. This measure is the . Among the criteria equentlyemployed in validating intelligence test~ is some index of ic ac . instructors' ratings.Validity: Basic Concepts ril1ciples Psychological Testing of ation. training versus elimination from the program . SlIl~Ilarly. from the primary grades to college and . To what extent ~re the obtain~d differences in intelligence test scores simply the result of the yarymg amount of education? And to what extent could the test have predicted individual differences in subsequent educational progress? These questions can be answered only when the test is administered before the criterion data have matured. In the validation of any of these of tests for use in the selection of college students. medicine. ~mong the specific indices of training performance employed for critenon purposes may be mentioned achievement tests administered on completion of train. they may be properly . nee. a frequent type of criteno~ is bas~d on performance in specialized training. and succ~ssful co~plehon of. dentistry.ct to many uncontrolled . military officers. as in predictive validation. and other line personnel that such a precauential. such as stenography. eliminating se incapable of continuing beyond each step. Any test may be validated against as many criteria e are specific uses for it.intelligence tests. moreover.p~rformance in music or art schools has been employed in validatmg musIC. selected group than elementary school graduates. and so forth.edby the individual's scholastic performance. ed with the criterion of academic achievement. s. e various indices of academic achievement have provided criterion at all educational levels.fi~hlg~ school or college courses. for example.. The assumption underlying this erite . provide criteria for aptitude tests in these areas. l\ful~lple . In the case of custom-made tests designed for use within a specific testing program. engineering. with such concurrent validation it is difficult to disentangle cause-and-effect relations. it would probably be subje.truly ultimate criterion is ever obtamed m actual practice. and other areas. that the educaal ladder serves as a progressively selective . respectively. Moreover. for example. social. fall into a few common categories.since the criterion ratings become "contaminated" by the owledgeof the test scores. t. in order to determine their validity as dIfferential predictors. special honors and as. economic. whether a". for example.or art. scores on a verbal comprehension test may be compared with grades in English courses spatial visualization scores with geometry grades. or bookkeeping. In their urgency to utilize all available information for decisions. achieveest scores. Several professional aptitude tests have been validated m terms otachievement in schools of law. EspecIa y at t e Ig er e ucationallevels. and other nonintellectual factors may influence the continuation of the individual's education.

groups have frequently been used in the development and vahdabon ?f mterest tests. the ratings themselves define the crltenon. This criterion . The ratings discussed earlier represented merely a suhsldl~ry tec?mque for obtaining information regarding such criteria as academIC achievement. religious. . For these reasueh intermediate criteria as performance records at some stage of iningare frequently employed as criterion measures. and othe. instructor. ratings of students bv school counselors and ratings by co-workers. Ps. in connection with other criterion catel?o~ies. III the validation of attitude scales include political. such as school grades. at least provide good interiate criteria for many testing purposes. Other groups sometimes employed. case. TIle contrasted groups included in the present category. industry. Similarly. or job succe:s. . This critcrion is ultimately based on sur\'iY~1 'thin a particular group versus elimination therefr?m. by simply choosingthe extremes of the distribution of criterion metsures. since it usually involves a longer low-up. classmates. :Ve are now considering the use of ratings as the viery core of the cntenon mea:ur~.similar jobs in different organizations..y~hiatric diagnosis may serve as a satisfactory criterion proVIded that It is based on prolonged observation and detailed case history rather than on a cursory psychiatric interview or examination. The method o~ contrasted groups is used quite commonly in the validation of personahty tests. ~ut involve a personal judgment by an observer regardmg any of the vanety. cou~s. The assumption' underlymg such a procedure is that. They are partIcularly useful In providing criteria for personality . In the developme. with reference to many socialtraits individua. an~ jO~ supervisors. there is no reason to expect the psychiatric diagnosis to be supenor to the test score itself as an indication of the individual's emotion~l ~ondition. h~~wer. Ratings have bee~ employed in t?e validl!tion of.of traits that psychological tests attempt to measure.44 Principws of Psychological Testing Validity: Ba51c Concepts 145 tors that would render it relatively useless. contrasted groups can be selected on the basis of any cnterion. Such a psychiatric diagnosis could not he regarded as ~ c:ltenon measure. origmality. Under these circutJIstances. performance in specialized training. leadership. Moreover.s in speCialized. and armed services. It is a common criterion in the validation of custom-made for specine jobs.ls who hav~ entered and remained in such occupatiQp~r~s selling or executive work Will as a group excel persons in such fie1~&~iisclerical work or engineering. suchuatings are not restricted to the evaluation of speci~c achievement. Mention has already been made.. ratings. although probnot representing ultimate criteria. the most satisfactory type of criterion measure is t based on follow-up records of actual . espectively.'~ngaged in man! . For example. the test perform~nce of salesmen or executives. the validity of a musical aptitude or a echanical aptitude test may he checked by comparing the scores obained by students enrolled in a music school or an engineering school. Oc~up~tlOl1al. test manuals reporting ~ditydata against job criteria should describe not only the specific 'terion measures employed but also the job duti~s performed by the rkers. To these can be added ratings by officers m mIlitary Situations. the scores obtamed by mSbtutlOnalmentally retarded children may be compared with those obtained schoolchildren of the same age.nt of certain personality t~sts. maybe compar~d WIth that of clerks or engineers. In this respect they are to be erred to training records. including work in business.. of certain types of ratings by school teachers. fellow club-~embers.the subjects in the vali~\ltion sample might be ~ate~ on such charactensbcs as dominance. Thus. Moreover. Validation by the method of contrasted groups generally involve~ a ill osite criterion that reflects the cumulative and uncontrolled selectIve fluencesof everyday life... or many purposes.ex~mp. but rather as an indicator or predictor whose own validity would have to be determined. mechll. ~eograp~lCal. For.le. or honesty.extracl~rricular activities may be compared V\'ith those who have partlcIp~ted 111 nOlle during a comparable period of college attendance. or other special groups generally known to represent distmetly different points of \oiew on certain issues. In this case. On the other hand. Because of the variation in the nature of minallv. Most measures of job performance.the criterion of job ptrformanee is likely to entail a loss in the mber of available subjects. in validating a test of social traits. are disti~ct groupsthat have gradually become differentiated through the ope~ation of the multiple demands of daily living. as well as ~n the preparation of attitude scales. college students who hav~. it would be cult to evaluate the relative degree of success of physicians practicing rent specialties and in different parts of the country. The criterion under cons~derationis thus more complex and less clearly definable than those preViously discussed.es. on the other. and to a larger extent in the validation of special tude tests.:nical ingenuity. on the one hand.mce does not permit as much uniformity of conditions as is ible during training.ob performance. such as the Strong Vocational Interest Blank. Th~s.almost every type of test. the measurement of perform.~n. .. The "jobs" in question may vary widely in both and kind. the professions. the multiplicity of factors etermining commitment to an institution for the mentally ret~rded constitutes the criterion. To be sure. e validation of an intelligence test. Similarly. with the scores of un selected high school or college student~. or job performa!!ce.been used to some extent in the validation of general intelligence as as personality tests. grou?~ of associ~tes. In th~ latter. psychiatric diagnosis is used both as a basIS for the selection of items and as evidence of test v~lidity.

example. That criterion-related validity may be quite specific has been demon.the variation in validity coefficients against job criteria reported l. when a given company wishes to evaluate a test for selecting .50 +1. Although in both instances the correlations tend to chIster in a particular '. £. }lfhis especially true of distinctly social traits. 20 > 0 SPECIFICITY CRITERIA. ~ 10 " 0 Ol 191coefficients for bench workers on finger dexterity tests.latter as a cdterion is indefensible. however. J)ifferences in the' crjtena themselves ~un~oubtedb' a m. clerks. for :lhample. The range of validity coefficients found.146 'J Principles of PSljchological Testing Validity: Basic COllcepts 147 . the variation among individual studies is considerable.00 15. This type of validation '..) Some. .75. has repeatedly served as a criterion in validating group tests. 29.test might be validated against a more elaborate and time-consuming per<i.the Differential Aptitude Tests. it might be added. 20 72 coefficients for general c1erh on intelligence tests. proficiency criteria o -1. strated repeatedly. the second graph summarizes in similar fashion 191 correlations . for . correlations between a new test and previously available tests i~arerequently cited as evidence of validity. :' between finger dexterity tests and the job proficiency of benchworkers.11 Finally.{. specificprogram is to be assessed. )ained under carefully controlled conditions they represent a valuable 's9urce of criterion data.The validity coefficient may be high and positive in one study and negli'. riterion-related validity is most appropriate OF C 'for local validation studies..!!iorr~ason for th. Thus. in which the effectiveness of a test for a . The Stanford-Binet. . the largest number of validity coefficients among boys fell between . as distinguished from basic research. .~range of validity...00 . a paper-and-pencil '. since objective criteria are much more difficult to find in this area. the latter can . When the new test is an abf . of . gible or even substantially negative in another. . ..22 to . ThIS Bgure shows the distribution of correlations obtained between grades in mathematics and scores on each of the subtests Of. but the correlations obtained in different mathem~tics ~ourses and in different schools ranged from . moreover/some variation is attributable to diHerences in homogeneity and lev~l~£ the groups tested .Chapter 20. proficiency criteria ~ c: ~ 'u . 10 ~ ~ '0 U •• 0 0 -1. . Equally Wide dlff~rences we~e found with the other subtests and. the new test may be regarded at best as a crude appro xi~mation of the earlier one.1" 1 C 0. •• . This is the approach followed. The .n FIgure 15 r~ults from differences among the specific tests emplo ed 10 different studies to measure' ... Or a group ftest might be Ivalidated:against an individu~l test. Thus. rity. when ob.00 FIG.:.tests.~~~ariatiQnQ~~~rvgafilong vali<lliy c~. Thus. Techniques for improving the accuracy of i:iatingsand for reducing common types of errors will be considered in .Properly be regarded as a criterion measure..50 and . Examples of Variation in Validity Coefficients of Given Tests for Particular Jobs.personal contact may constitute the most logically defensible criterion. thc duties of offic~g~rks or berichworkers may differ :~2.59. -0.~'firstgraph shows the distribution of 72 correlations found between in~:telligence test scores and measures of the job proficiency of general c.. formancetest whose validity had previouslv been established.. . t.. 1966. In the resu s 0 0 19ures and 16. is far wider than could be explained in these terms. (Adapted from Ghiselli. "In such a case. the use of the .represents applied research. Figure 15 gives examples of the wide variation in the correlations of a single type of test with criteria of job proBciency. applicants for one of its jobs or when a given college wishes to determine i how well an academic aptitude test can predict the course performance ~" ofits students.50 . It should be noted that unless the new test .'. Similar . Criterion-related validity can be best characterized as the ~practical validity of a test in a specified situation. iiAlthoughratings may be subject to many judgmental errors. p.00 +0.00 +1. for the Numerical Ability test (NA).breviated or Simplified form of a currently available test. and as : such it provides results that are less generalizable than the results of I other procedures. ~. WIth grades 10 other subjects not included in Figure 16.vari~tion with r~gard to the prediction of course grades is illustrated m Flgure 16.represenl~a simpler or shorter substitute for the earlier test. in which ratings based on is .

148 Principles of Psychological Testing Validity: Basic Concepts 149 . another of his ability to spell correctly. 1960).different indicators or measures of job proficiency or academic achievement could thus be used in validating a test. Differential Aptitude I I Th bad ac. Several . Because of criterion complexity. are reflected in the validity coefficients of any given test against different criterion measures. If. In ma~r situations. now. IIi ence and aptitude tests teria most commonly used m vaiidatmg mte g d' 'namely. or in other actiryties of daily life depends not on one trait but on many traits.~r~' ~vi~n th:9.nne)'. . . Since these measures may tap different traits or combinations of traits. It follows that criterion-related to temporal changes. t'. p.• . shIfts In orgamza IOna ('. ' . 1967 P . 'dely among compames or amo~. t t change over t'Ime. GraphIC• Summary 0 f· "\' al'd'ty Coefficients of the . often differs 1966) There ce of ~ iven job e. or other similar accomplishments '~a be of uestionable value and is certainl of limited generality.. e. ' . and still another of his ability to resist distraction. validating a test against a composite criterion of job proficiency. 1960. courses in the same su Jec may t' student achieveethod'instructor characteristics. however.peri- rather than static. to be the ' c· ntly w llat appears ellt.data for the same job (Seashore. 1959). Success on a job. asons such as changmg nature (} al d't'ons ' . . 'evidence that the traits required for successful terfo~~:nor For example. job performance and edut:ational achievement-are ynamlc < SYl':mETIC VALIDITY. Smith..g. criteria e validitv coefficient of a test against Job . (Forms Santi T) for Course Grades in Mathematics! em ~rst ~ the anyingnumb~r. dividual advancement In ra~ ' an kn° f course that educational .tra~m(th' lli omits v~lidity against job performance cntena Ise. Hence. & J aeo bsen. Wallace. it is not feasible to follow this procedure be~jise of well-nigh insurmountable practical obstacles. Thus. we are faced with the necessity of conducting a separate validation stud in each loc tion and re eatin it at frequent mten~ S. tP differ in content. it is not surprising to find that they yield different validity coefficients fpr any given test.) by The Psychological Corporatlon. riteria may also vary over Ime In. ~llor. academic achievement. the same situation. R. An analysis of these more speCific reIahonships lends meaning t6 the test Scores in terms of the multiple dimensions of criterion behavior (Dunnette.h~~~~I~ri:ri~~:~ge ove~.e fo1rgOotha~r .. validity is itself subject b or even a. and l~umerous other ways. Richards. timt. 1965). the meula an course con en. 'practical criteria are likely to be multifaceted. d nen. Ebel. mosf Critf:'rion-related validity studies conducted in industry afe likely to prove unsatisfactory for 16. . 1960) There is also ~:~~::le~d~:1~~I'Sh~~~6. one test might prove to be a valid predictor of a clerk's perceptual speed and accuracy in handling detail work. '\'hen different criterion measures are obtained for the same individuals. si~g~de taslk(. Indik. Fifth Edition Manual. a test may fail to correlate significantly with supervisors' ratings of job proflciency and yet show appreciable validity in predicting who will resign and who will be promoted at a later date (Albright.sin each column indicate the number 0 coe clen S In givenat the left. bases for evalua mg . teaching milarlv.. 1963. k d ther tempor con 1 1 IV. & Glennon. & Georgopoulos.~~o~~iS~!::c& Fruchter. In other words . ce of the mdivi ua eiS m .. 1966) It IS we ll own. If different subcriteria are relatively independent. 1961. of course. . For example. but they are also likely to be complex (see. 0 . This is admittedly a desi procedure and one that is often recommended in test manuals. a more effectIve procedure is to validate each test against that aspect of the criteiiO'i1Jf IS best designed to measure. S. Price. we return to the practicClI question of evaluating a test or combination of tests for effectiveness in predicting a complex criterion such as success on a given job. Even if adequatel~ p'ained' personnel are available to carry out the necessary research. de artments in the same company. For instance. accident records or absenteeism may show virtually no relation to productivity or error . their interoorre!atioDs are \" often quite low. . These differences. in school. " R roduced by permiSSIon. N ew or. CopyrIght © 1975. Criteria not only differ across situations and over time. o~sd~e:ent ' combmation of traits in e critefJon ma resent ver i rrent situations. 82: eP Y k N Y All rights reserved. f )0b s.. 1965).lac.' .

and (3) finding the validity of each test for the given job synthetically from the weights of these elements in the job and in the test. Correlations between test scores and sell-ratings on jOp. and these ratings were then checked against the employees' scores on each test in a trial battery. and these produtcs are added across all appropriate job elements.sts have .aHecting i~ developm. Any data thrOWIng hght on the nature of the trait under consideration and the ~~~tions .7.alking. Second. For these purpose. Because of the small number of cases. It oH~rs . For a description of the actual procedures followed.the Job. : For all the reasons discussed above. 14.shownincreasing interest in a technique 1. 1965..ing procedures are fon9~~ed to ensure stability of correl~tions and weights derived from self-~~~gs. Examples of such constructs ~re mtelhge~~. a determina_Honof test validity for these elements..}p$]Jmbents. Detailed job analyses nevertheless revealed seve.a promising approach to the problem of complex and changmg. the Concept of synthetic validity can be imple~ented III diHerent ways to fit the practical exigencies of different situatIOns.and more abstract kind of behavioral description t'han the previ.n job elements commo!}Jto many jobs. and anxiety.!9Pnd from the correlation of each job element with the pifticular job'. 1\~~£nal estimate of correlation between test and job performance is. In a long-term research program conducted with U. First. 1975). ':&-" . each of whom was doing a job that was appreCiably different from the Jobs of the other employees. Among the special features of this procedure are the listing of job elements expressed in terms of worker behavior and the rating of the relative importance of these elements in each job by supervisors and jo1}. Various chec1.tso Validity: Basic Concepts Principles of Psychological Testing 151 at leastthree reasons. neurotiCIsm. si~ce polythose persons actuany hired can be followed up on . as wen as adequacy of C[lterion coverage. On the basis of these analyses. to . cntena. its correlation with the job is multiplied by its weight in the test. Lawshe & Balma. l'rimoff. a separate battery could be "svnthesized" for each job by co~bining the two best tests for each of the j~b elements demanded by that Job.~~ the weight of the same element in the given test.. speed of . Each employee was rated on the Job elements appropriate to his job. these results are only suggestive. for construct vahdatlon Will be considered below. correlations will very ~robablybe lowered by restriction of range through preselection. personnel psychologJ.e~t and manifestations' are grist for this . ~f·' .. !he construct validity o~ a test is the extent to which the test may be saId to me~ure. McCormick." Several procedures have been developed for ". construct validation requires the gradual a~um~latIon of mfonnation from a variety of sources. In ~ummary. . the results showed conSIderable promi~e. Primoff (1975) has developed the J-coefficient (for "jobcoefficient") as an index of synthetic validity. The study was conducted primarily to demonstrate a model for the utilization of synthetic validity. .. Thlfd.S. The statistical procedures aTe essentiaIly an adaptation of multiple regression equations. it is difficult to obtain dependable and sufScientlyomprehensive criterion data. a theoretical construct or trait.. more endunng. as \Firstintroduced by Lawshe (1952). the number of employees c engagedin the same or closely similar jobs '~ithin a co~pany i. mechanical comprehension.Jj~m~~s are found in total applicantsamples (not subject to thep~1f'~-~~lW? of employed workers).e. 1975). to be discussed in Chapter. For each job element. detailed job analysis to identify the job elements and their relative _weights. The two examples of synthetic validity were cited only to illustrate the scope of possible applications of these techniques. 1959. verbal fluency. When the batteries thus assembled were applied t~ a subsequently hired group of 13 employees.s often 60 small for significant statistical results. 1966. 395) as "the inferring of validity in a specific situation from a systematic analysis of job elements.~a~a_ are ?btained from d~Herent samples of applicant populations.i extent to which it measures proficiency in performing each of these Job elements. and it permits the assembling of test batteries to fit ~he reqUIrements of specific jobs and the detennination of test validity 1D many contexts where adequate criterion-related validation studies are impracticable. A different application of synthetic validity.g. The study was carried out in a company having 48 employee~. the process involves three steps: (1) _. Focusing on a broader. the reader is referred to the ariginal sources ."110\\'11 synthetic validity.. is described by Gmon (1965). p. because of its concentration on job-relevant skills (Primoff. Ch. Essentially. and a combination of elemental fvalidities into a ~'hole. especially suitable for use m a sn~all company with few employ~es in each type of job. Guion.al~dl~ mill: IllustratIOns of speCific technique~ $uitabl~.' There i" evidence that the J-coefficient has proved helpful in improvin~ th~ employment opportunities of minority applicants and persons WIth lIttle formal education. 1959. obtainan estimate of synthetic validity for a particular complex cntenon (see. the concept of synthetic validity has !beendefined by Balma (1959. (2) analysis and empirical study of each test to determine ~he .of validity. Civil Service job applicants.gathering the1needed empirical data and for ~mbining these d~ta.s. ously dlscusse~ types .

~the . E. then the new test represents needless duplication. age differentiation is a necessary but not a sufficient condition for validity. hlgh correlations. Thus.'. analogies oppOSites. T'here is thus. Correlations between a new test and slIDllar earlier tests are sometimes cited as evidence that the new test me~sures apprOximately the same general area of behavior as other tests des~gnated by"the ~ame name. The construct vahdahon of ~rdi~al scales should therefore include empirical data on the sequential 10variance of the successive steps. Lik~ all ~th~r . Moreover. and sent~nce ~mpletioJl •• high correlations with each ~ther and low correlations With all ot~ ~ts. ". Similarly.ch tests a."sample.s~entia. a hierarchical pattern of learned skills. On the other hand. . it ~as found li~ited u~e. the first step is to compute the correlations of each t~st Wlth e:ery other.lly. A measure of height or weight would al~o show regul~r ag~ inc1'ements.: DEVELOPMENTAL CHANGES. is based on the assumption that "in~telligence"increases with age. . if the . more precIse statistical teclm1ques have blWft developed to locat th . reading. major criterion employed in the validation A '. can utilize empmcal eVidence of hierarchical invariance in their validation. Thus" it_ tests as vocabulary. Thus. suggesting the 10catI?n of common traIts. such as conservation or object permanen.:'0£ intelligence. Do children who demonstrate mastery of the concept at a given level :also exhibit mastery at the ~ower levels? Insofar as criterion-rt:ferenced tests are also frequently deSIgned CO~~Anoss WlTIl OTHER TESTS. an ~ntrinsic h~erarchy in the content of these scales. The criterion of age differentiation. . In these cases. mterpretahon . such an increase. An inspection of the resulting table of 190 eoi-relati. even when apphcable.of ~e age criterion.~uch an inspectional ana ~m of . f particular relevance to construct validitv is fador O an~lySlS: a s~atistical procedure for the identification of psy~hological ~ralts. it cannot be assumed that the criterion of age differentiation is a universal one .functions that do not exhibit clear-cut and consistent age changes.eases with age does not define the area covered by the test very precisely.a m~ues a .' A final point should be emphasized reg~rding the. for example. cd """'e e common f ac t ors reqmr to account for the'ttbtai.Stanford. these correlahons sh~uld be ~oderately high.ifferentiation. certain clusters among the tests. IS mapp1icable to any .actor a~alysis will be e~amiil~d further in Chapter 13. If the new test correlates too lughly With an already available test.. would n~t 10 t~emselves insure validity. Because'. ·ned co i ti'ons. Developmental analyses are also basic to the construct validation of the JPiagetian ordinal scales cited in Chapter 4. would make the test suspect. reading comprehension should not appreCiably affect performance on such tests. it is argued that the test scores should likewise show . of course. . if 20 tests have been glven ~o 300 persons.Validity. however. In the area of personality measurement. if the test is valid.with advancing age. too. A psychological test validated a?amst such a cnteno~ measures behavior characteristics that increase w1th age under the condl' tions existing in the type of environment in which the test was standardized. ' FA~OR ANALYSrs. Low correlations.e a neglIgtble correlation with tests of general inte1hgence ~r scholastic aptitude. . but not too high.ofa number of intelligence tests is age d. correlations with t~sts of general intelligence. For example. criteria. Correlations with other tests are employed in still another way to d~m~nstrate that the new test is relatively free from the influence of certa~n m~le:ant factors. they. A fundamental assump. factor analysis is a refined technique for analyzing the I~terrelationships of behavior data.Y. together WIth multnple aptItude tests developed~~y means of I~r analysis.O~ may Itself reveal. as initiated by Binet.ce. For ex~~ple. Basic Concepts 153 acco~~ing to. Th ese tee h . It will be noted that this use o~ correlations With other tests is similar to one of the supplementary techmques described under content validity. such as "intelligence tests" or "'mechanical aphtude tests: Unlike the correlations found in criterion-related validity. at least until maturit.. withuut such added advantages as brevity or ease of administration. Su. This involves checking the performance of children at different levels in the development of any tested concept. Since abilities are expected to mcre~se \~lth age . The very concept of an age scale . or verbal comprehension are someh~es reporte~ as indirect or negative evidence of validity. to prove that a test measures something that illcr.a ~rrelaho~ table is ~t and uncetjtirln. such a finding probably indicates " that the test is not a valid measure of the abilities it was designed to . it should be noted that. a special aptitude test or a personalItr teat "hould hav. Because different cultures may stimulate and foster the development of dissimilar behavior characteristics. test scores fail t~ improve with age.(:e of a verbal :omprehe~ioj "tor. .during childhood. ff "rre.Binet and most preschool tests arc checked agamst chronolog':ical age to determine whether the scores show a pr~gressive i~crease . it is circumscribed by the particular cultural settmg m whlCh It is derived.. however. tion of such scales is 1thesequential patterning of development. we could tentatively infer the pre~en. such that the attainment of earlier stages in concept development is prerequisite to the acquisition of later conceptual skills.although it would obviously not be deSignated as an mtelli- '\ gencetest.

qcedures may also be employed for this purpose. or both. Items that are commonly falled on both tests are too difficult. A. Another application of the criterion of internal consistency involves the ~orrelation of . It should be noted that factorial validity is entially the correlation of the test with whatever is common to a group of tests or other indices of behavior. The essential characteristic of this method is that the criterion is none other than the -total score on the test itself. one approach is through a comparison QE pretest and posttest scor~s.arent that internal consistency correlations.tfait ~~SUred by a gIVen test. ofcourse. In a similar w.' .66 in a vocabulary test. the statement is made that the test has been validated by the method of internal consistency..~. Many intelligence tests. for example. especially in the area of personality. In the published descriptions of certain tests.I~ariables on test scores.subtest scores with total score. rather than in tcrms of the original 20 scores. and other indices of an~iety expression du~pg and after the exammatIon. A test whose items were selected by this meth.. In the construction of such tests. Thus. for lD:tance. reBect current anxiety level. i EFFECT OF EXPERIYENTAL VARIABLES ON TEST SCORES' A further source of data forconstmct validation is provided by ex-periments on the effect of selecte(.' exper4.lI.Pns would be retained. A major purpose of (>ranalysis is to simplify the description of behavior by reducing the er of categories from an initial multi licit of test vari bles to a few 1 155 ac . admlms~ered b~fore ~he relevant instruction.t~~ng an examination under distracting and stressful conditions.. uch a correlation is known as the factorial validity of the test. designed to test any other hypothesis regarding th. Items that fail to show a significantly greater proportion of "passes" in the upper than in the lower criterion group are considered invalid.qd can be said to show internal consistency.' . the number of variables or . he factorial validity of this vocabulary test as a measure of the trait of erbal comprehension is . In checking the validitv of a ~riterionreferellce'O test for use in an individualized instruction~l program. extr'"eme groups being selected on the basis of the total test score. picture completion.~. such as . include both test and nontest data."pass-f~il" .cate~ories erms of which each individual's performance can be descnbed lS reed from the number of original tests to a relatively small number of rs. Ideally. INTERNAL CONSISTENCY. there is obvlOusly something wrong with the item. lf he verbal comprehension factor has a weight of . Correlation~l pr. Because it helps to charactenze the behavior domain or trait sampled by the test. for t~e purposes of such a test. . little can be learned about what a test measmes. Only those Items )'leldmg significant item-testcorr~fliJi. If a sizeable proportion of exa~mees pass an ltem on thc pretest and fail it on the posttest. five or six factors t suffice to account for the intercorrelations among the 20 tests. con~lst of separately administered subtests (such as vocabulary. ThiS relationshIp can also be checked for individual items in the te~t (Po. whether based on Items or subtests..y.Validity: BasicCaacepts Principles of PSljchological Testing n the process of factor analysis.on each item and total test score can be computed. or the instruction. test designed to measure anxiety-proneness can be administered to sub!ects who are subsequently put through a situation designed to arouse amQe. The set of variables analyzed can. Ne. It is app.Aft~rthe factors have been idcntified. ~he contribution of internal consistency data to test vahdatlOn IS very limited.h. 1971). In the example cited above. In the absence of data external to the test itself. to explore the factorial validity of a particular test and to define the common traits it measures. Each test can thus be cl1afacterized in rmsof the l1)a)or factors determining its scores. the degree of homogeneity of a test has some relevance to its construct validity . PosItive flndmgs from such an experiment would indicate that the test scores. Sometimes an adaptation of the contrasted ..vert~eless. For example. and are either ~liminated or revised. the largest proportion of examinees should fall an Item ?n the pretest and pass it on the posttest. are essentially measures of homogeneity. the scores on each subtest are often correlated with total score and any subtest whose correlation with total score is too low is eliminated. the biserial'correlation between . along with other tests. grOUpmethod is used.lents can be. and those passed on both tests ~oo easy. The correlations of the rem~ining sUbte~ts with total score are then reported as evidence of the Internal consistency of the entire instrument.pharo. The lDltlal anxiety test scores can then be correlated with phySiolog!cal. and high scores on the posttest. anthmehc. they can be utilized in describing e factorial composition of a test. The rationale of such a test calls for low scores on the pretest.66. Ratings and other criterion 'measurescan thus be utilized. together with the weight r loading of each factor and the correlation of the test with each facto~. Each 'dual might thus be described in terms of his scores in the five or six ors. or common traits. since each item differentiates in the same direction as the entire test. The performance of the upper criterion group on each test item is then compared with that of the lower criterion group.) whose scores are combined in finding the total test score. A different hypothesis regarding an anxietY· test could ?e evalua~ed by admini~tering the test before and after an anxiety-arousmg expen:~ce an~ seemg whether test scores rise Significantly on the retest. etc.

C3 associates' ratings on achievement motivation. A Hypothetical (From Campbell & Fiske... they should also be higher than the correlations beNVERGENT AND DISCRIMINANT VALIDATiON. A. esp~cially in the construct validation of personality tests.. along principal dIagonal) and validity coefcients (in boldface. ampbell and Fiske (1959) proposed a systematic experimental deSign the dual approach of convergent and discriminant validation..•. for example. on a .~ t I (. each measure "is. patterns of correlations with measures of other personality traits. C. B. along three shorter diagonals).•. . For the same test...2.)2: :. Sohd tnangles enclose heterotrait-monomethod correlations. representing common method variance.. Yet the end~rance scores obtained with the two inventories may show quite diffe~~nt. the validity coefficients should obviously be higher than the correlations between different traits measured by different methods.ically elate. (2) a projective tech'iquc.. Method 1 B. These ~rrelab~ns Involve the same trait measured by the"same method. 1'" 67-'---42-------: •••••• I l ~~1 :.. Campbell and Fiske ) described the former process as convergent validation and the er as discriminant validation.inant va~ionis also especially relevant to the validation of personality tests.~IM&~) ". D.~~~':Q. . ...ability cofficients (in parentheses.."_:~~~~ ..'.~~:~~~45 ~~ .it should ~heoret.85) Note: Le~tersA.ons between different traitsllleasured by different methods (Ill broken trIangles).. In an earlier article. I. . and so forth. Method 2 B.~6'. 1959. :-rere hIgh.the scores obtained fc". C refer to traits. C. two il)vestigators may each pliepare a self-report Inventory designed to assesseIl. ~22---:11: I : ....:. independent measures of the same'::'trait. . that a person's scores on this Inventory are unduly affected by some irrelevant common factor such as ability to understand the questions or desire to make oneself appear in a favorable light on all traits..•_.thod'(in solid triilngles) ~nd corrclati. For satisfactory construct validity.34: I '.58"'.~~~p same trait by different methods arecorrelated. Thus..• ·::. The hypothetical correlations giv~n in Ta~le 12 include reli..•.. 'l tween different traits measured by the same method. 82. In these validity coefficients.. T. I .o Qr ~ore metho~s. The table also includes correlations between different traIts measured by the same riJ":. Campbell (1960) points out that in order emollstrate construct validity we must show not only that a test cores highly with other variables with whi~h ... discriminant validity would be rated by a low and insignificant correlation with scores. which ey called the multitrait-multimet1lOd J7latrix.thu~ 'being checked against other.:~~~::::~58~ L . ventor~. Essentially. Fiske (1973) has added still another set of correlations that should be checke~.. . subSCripts1..33 I : :... since reading ability is an irrelevant varIable m a test gnedto measure mechanical aptitude. ~s'.. but With a dlffer~nt test..56::-:. such as A) dominance. For examplc.58 •. Al would indicate dom~na~ce oreson the self-report inventory. A2 dominance scores on the projective est.3 to methods.~'::>'.23'"."'-~"':'_~~~'~:.TABLE 12 Multitrait-M:ultimethod Matrix In a thoughtful analysis nstruet validation.. A pathetical example provided by Campbell and FIske WIll serve to IUUSate the procedure. this procedure quiresthe assessment of two or more traits by tw.. broken triangles enclose heterotrait-hcteromethod correlations. Validity coefficients (rnon~tralt-heteromethod) are the three diagonal sets of boldface numbers.. . t will be recalled that the requirement of low correlatlOn WIth trrelet variables was discussed in connection with supplementary and pretionary procedures followed in content validation.'.~I.::~:=.. reliability c~efficlents (~ono~ralt-rnonomethod) are the numbers in parentheses along principal diagonal. the COITf:lationbetween dominance scores from a self-report inventory and dOITt~ijancescores from a projective test should be higher than the correlatIon between dominance and sociability scores from a self-report in. Table 12 shows all possible correlations among the ores obtained when three traits are each measured by three methods. but also that it does not correlate sIgmficantly wIth van abIes which it should differ. '.. (B) sociability. .t.43 '". In ich irrelevant variables may affect scores in a variety of ways.and (3) associates' ratings.-~..) Traits A..durance.:. and (C) achievement motivation.•. e three traits could represent three personality characteristics. Discrin. p.. Under these ... It mIght inllicate. The hreemethods could be (1) a self-report inventory. Correlation of a mechanical aptitude with subsequent grades in a shop course would be an example of vergent validation...reading prehension test.... n the familiar validati~n procei dure. If ~he l~tter correlation.. For example.

or other postulated traits. Comparing the test scores of institutionalized mental retardates with those of normal schoolchildren is one way to investigate the construct validity of an intelligence test. which includes the other types. he too relies in part on content validity in evaluating any test. let us apply each in turn to a test consisting of 50 assorted arithmetic problems. The correlations of a mechanical aptitude test with performance in shop courses and in a wide variety of jobs contribute to our understanding of the construct measured by the test. We have considered several ways of asking. together with the type of validation procedure appropriate to each. could have heen listed again under construct validity. validity represents agreement between measures of the same trait obtained by maximally different methods. Validity against various practical criteria is commonly reported in test manuals to aid the potential user in understandin~ what a test measures. This example highlights the fact that the chOIce of valIdahon pro. As for the test user. it needs to be evaluated against the criterion of subsequent college performance rather than in terms of its content validity. the test constructor is guided by hypotheses regarding the relations between the type of content he chooses and the behavior he wishes to measure. reliability represents agreement between two measures of same trait obtained through maximally similar methods. however. a:e illustra:ed ~n Table 13. The term construct validity was officially introduced into the psychome~rist's lexicon in 1954 in the Technical RecommenN4a{ions for Psyc11010glcal Tests and Diagnostic Techniques. he may check the vocabulary in an emotional adjustment inventory to determine whether some of the words are too difficult for the persons he plans to test.ditions. and construct validity do not correspond to distinct or lOgically coordinflte categories. he may conclude that. In fact. the hniques actually employed to measure rehabllIty and validIty corond to easily identifiable regions of this continuum. cedure depends on the use to be made of the test scores. when employed for different purposes. as when selectinO' hiO'h school students for b t:< college admission. such as neurotics and normals. is one way of checking the construct validity of a test designed to measure emotional adjustment. In assembling items for any new test. Further consideration of these procedures. For example. Achievement test in elementary school aritlunetic Aptitude test to predict performance in high school mathematics Technique for diagnosing learning disabilities Measure of logical reasoning How much has Dick learned in the past? How well will Jim learn in the future? Does Bill's performance show specific disabilities? How can we describe Henry's psychological functioning? Criterion-related: predictive Criterion-related: concurrent Type of Validity at a higher educational level. construct validity is a comprehensive concept. focus on the differences among the various types of validation procedures. the scores on a particular test depend too much on speed for his purposes. Although the validation . criterion-related. or he may notice that an intelligence test developed twenty years ago contains many obsolescent items unsuitable for use today. The examples given in Table 13.it cannot be concluded tllat both inventories measure the same ·sonalityconstruct of endurance. On the contrary. Comparing the test performance of contrasted groups. Content validity likewise enters into both the construction and the subsequent evaluation of all tests. represent ways of testing such hypotheses. anxiety. as well as the other techniques discussed under construct validation. ho\~'e~er. shows that content. theoretically reliability and validity can regarded as falling along a single continuum: O~~inarily. . All these observations about content are relevant to the construct validity of a test. The same test. All the specific techniques for establishing content and criterion-related validity. . Although he may not be directly concerned with the prediction of any of the specific criteria employed. there is no information provided by any validation procedure that is not relevant to construct validity. Since similarity and difference methods arem~tters of degree. chas test scores and supervisor's ratings. such as alle! forms of the same test.Four ways in which this test might be employed. should be validated in different ways.If an achievement test is useet to predict subsequent performance 13 TABLE Validationof a Single Arithmetic Test for Different Purposes Illustrative Question .Principles of Psychological Testing Validity: Basic Conc('pts 159 . by examining such criteria the test user is able to build up a concept of the behavior domain sampled by the test. "How valid is this test?" Topoint up the distinctive features of the different types of validity.. which constituted the first edition of the current APA test Standards (1974). t might be noted that within the framework of the mnlhtrmt-mulhhod matrix. discussed in earlier sections of this chapter. All the techniques of criterion-related validation.

This information could be used directly in assessing the relevance of the test to the required job functions. the data cited in the manual should permit a specification of the principal constructs measured by the test. been oHered as if it wcre l~al'.Jles of PSlJchological Testing procedures subsumed under construct validity were not new at the time. Through an analysis of the correlations of different criterion measures with each other and with other . 291).~vo ve w e~ever ~ test is to be interpreted as oned'. ~~~ps that diff~r significantly in such butions made bv the co t n fIre :. Like synthetic validation. rationale for their use. construct validation is suitable for investigating . Constr~ct validation offers another alternative approach that could be followed in evaluating the appropriateness of published-tests for a particular job.. If these confirmatory results were then to be reported without mention of all the validity probes that yielded negative results. the theoretical ' r omam measured bv rti I b e a d equateI).s chapter. however. h ey cnhclze tests for wh' h" c alme t fi . it has bE':~~~~~~ v. Under any circumstances. Appearing in the first detailed often incorrectl acce ted :c~p ~nstruct "alidity. ncep 0 construct valid'ty I ' the empirical investigation of the r I' h' 1.~~~~. this approach requires a systematic job analysis. It is particularly appropriate in the evaluation of tests for use in research. If. ~g t 0 data gathered in the process of abIes with which th~ test c~ ~ ~lhO~ would take into account the varifound to affect its Scores an~et~ ed SIgnificantly.g. defined only' th I' h f . . In practical contexts. Since . Construct validation has also stimulated the search for novel ways of gathering validity data. 1973).." (Cronbach & ~:e~tel~r quahty whIch is not 'operationally depublished analysis of the co' 282). e. . the field of operation has been' expanded to admit a \\rider variety of procedures. a few of which will yield positive results by chance. if the correspondence of constructs is clear enough. ome textbook writers and test psychological trait na~lescelVe It as content validity expressed in terms of subjective accounts of ~h~:~~e.m aCcord w1th the positive contrl. Although the principal techniques employed in investigating construct validity have long been familiar. construct validation "is . 5. as in the local validation of some personnel tests for industrial use. It is possible for a test constructor to try a large number of different validation procedures.:''terms of relevant behavioral constructs. . validity. or it could serve as a basis for computing a J-coefficient or some other quantitative index of synthetic validity. the test has bcen subjected to sufficient research prior to publication. Actually. 291) In th . .the sta~e~ent did. Another possible danger in the application of construct validation is that I I Validity: Basic Concepts 161 it may open the way for s b" . The difficulties encountered in these situations were discussed earlier in thi. the r~sults of such a study may lead to modification or replacement of the criterion chosen to validatc a test. James. ~ J~chve. III the same e war ma k es contact with b . trait or behavio d a I~n (p. Construct validation has focused attention on the role of psychological theory in test construction and on the need to formulate hypotheses that can be proved or disproved in the validation process. one can learn more about the meaning of a particular criterion. rs 00. that "unless the n t k d b ytheIr own inSIstence.a pa cu ar test can validating that test Such I~ Iie. the discussions of construct validation that followed served to make the . this statement was the absence of ~ata ~hat t~ Justifrng a claim for construct validity in such an interpretati. .n is i1lus:a~~t ors of .ne~pun network of rationalizations has l construct.s ~uc~ asbroad and loosely dcflned canconstructors Seem to ~r . In some instances. not intend article. Another practical application of construct validation is in the evaluation of tests in situations that do not permit acceptable criterion-related validation studies. now. construct validation cannot b I' d" 0 servations . as well as the conditions scores. This very multiplicity of data-gathering techniques. t. followed by a description of worker qualifications expressed in . and through factorial analyses of such data. in connection with synthetic validity. e same connectIon. unvenfled assertions about test cept. implications of these procedures more explicit and to provide a systematic . e (p. I e Co~fuslOn anses from a statement that d a measure of some at . a very misleading impression about the validity of a test could be created. presents certain hazards. These procedures are e . the results will enrich the interpretation of the test validation study..?f' . the validity of the criterion measures used in traditional criterion-related " test validation (see.160 Pritlci... relevant variables. . t IS only through external data that we can d' ehahons IpS of test SCores to other ISCOverw at a test measures. t~e~ present as construct validity purely A further source of ossibl ey e ~ve (o~ hope) the test measures.

chapter deals with quantitative expressions of vahdlty and theIr mterpretation. when a test user relies on published validation data. The same test may measure different functions when given to individuals who differ in age. The question of sample heterogeneity is relevant to the measurement of validity. if we know a student's score on the DAT Verbal Reasoning test. Or it might be a valid measure of different functions in the two populations. Jobs bearing the same title m two dIfferent companies are rarely identical. The specific procedures for computing these different kinds of correlations can be found in any standard statistics text. the cntena employed in published studies cannot be assumed to be iden?cal.'.ling with construct validity. the higher will be the correlation. f~nctions. Fig. The test user is concerned with validity at either or both of two stages. may utilize different work methods to solve the same test problem. or any other relevant characteristic. he IS dea. The dete:t'inination of validJ!Y agamst specific local criteria represents the second stage in the test ~r's evaluation of valKTfty. It will be recalled that expectancy charts give the probability that an individual who obtains a certain score on the test will attain a specified level of criterion performance. educational level. as when a two-fold pass-fail criterion is employed (e. First. and he judges the relevance of such function~ to his p. p. he examin~ailable validit)'data reported in the test manual or ot~er p~ed so. Persons with different experiential backgrounds. for example. as in this example. a test could have high validity in predicting a particular criterion in one population.66 When both test and criterion variables are continuous. te~t users are . 7. Because it provides a single numerical index of test validity. Two courses in freshman English taught in different colleges may be quite dissim~1. 4).HArTER 7 MEASUREMEXT OF RELATIONSHIP. In effect. the wider the range of scores. illustrated in Chapter 4.Through such in~ormation. 4.e can look up the chances that he will earn a particular grade in a hIgh school course. Although publishe'd dat~ay str~ngl~ sugg~st that a given test should have high validity in a particular sltuatio~. validity should be redetermined on a more appropriate sample. when considering the suitability of a test for his purposes.rop~sed use of t~e test. unless the validation s~ple is repri'!seiififiVe of the population on which the test is to be used.. It will be recalled that. the familiar Pearson Product-Moment Correlation Coefficient is applicable. HAPTER C . For example. with Table 6 (Ch. other things being equal.us~ally advised to check the validity of anv chosen. with th~se the test user wants· to predict. 101). A validity coefficient is a correlation between test score and criterion measure. Other types of correlation coefficients can be computed when the data are expressed in different forms. As we have seen in Chapter 6. As in the case of reliability.ser hlms~1f. The same data yield a validity coefficient of . occupation. such tables and charts provide a convenient way to show what a validity coefficient means for the person tested. Thus. it is essential to specify the nature of the group on which a validity coefficient is found.The teChnIques ttr'1le dIscussed 1~ this chapter are esp~cially relevant to the analysis of validity data obtamed by ~e test u. as it is to the measurement of reliability. COI\"DITIONS AFFECTING VALIDITY COEFFlCIEXTS. The data used in computing any validity coefficient can also be expressed in the form of an expectancy table or expectancy chart.. In fact.l.Jltces.g. he arrives at a tentative concept of what psychological fu~ctlOns the test actually measures. Consequently.. regardless of the specific pro?ed~res used m gathering the data. t~is. however. "". it is commonly used in test manuals to report the validity of a test against each criterion for which data are available.r· i Because of the specificity of each criterion. dlTee: corrobo~ation is always desirable. and little or no validity in another. sex. 'test agamst local cnterla whenever possible. alidity: Measuremel~t and lrlterpretation 6 was concerned with different concepts of validity and their appropriateness for various testing. in understanding and mterpretmg the validity data reported in test manuals. Most of them are also useful. This fact should be kept in J6z . Ch.since both characteristics ale commonly reported in terms of correlation eoefficiElnts.

. we should be rea~ . of estimare coe . however. and the like. For example. Kahneman & Ghiselli. the eITor of esti=~: :~ a res~t of th~ unreliability of the t~t. WI e WIder variability of criterion dents. when it is admindster¢d to all applicants for selection purposes. mong th Ig -scor' e t d . ues lOll IS pOSSIble. to enable employees to read instructiorl manuals.. It wl"ll be recalled pected in an individual's n Icates the margin. II exammation of the nature of the relations·hip b 'tV usua y give a. encountered in many validation samples arises from preselection. Y e 0 owmg fOfn. it might have " been 'Wrongly concluded that this was the case.~c~~n WIth reliability. labels. coe clent ~ust take into account a number be high enough to be sta~~~·' o~tamed correlation. however. Hence. however. further increments in reading ability may be unrelated to degree of job success. 1959. we need to e correlat1~n between test Scores and light of the uses to be m d f v~luate the SIZeo~ the correlation in ~he vidual's exact criterion s~ e 0 ~le test. not er words.52 over the 30 years. bwanate dIstrIbution is known as hctero'J' e earson correlatio h variability throughout tb ~ assum:s ?moscedasticity or eqll. 1962). a new test that is being validated for job selee. In the at the upper end and n na e shtn utIon would be fan-shaped-wide ' . the later class was more homogeneous than . d th mg s u ents. An examination of the r' bivariate distributions dearly reveals the reason for this drop. cn erIon score. Clen cou not . m the individual's predicted 't o~s t e margm of ~r.tionmay be admini$tered to a group of newly hired employees on whom . In other words. How hIgh should a validity . 1965).:the earlier class in both predictor and criterion performance. s ISCusse in Cha t 5 I h ' drawing any conclusions about th • lid' per . An example is provided by a comparison of validity .'selection standards. should such as the 01 or 05 level' d~8. (Fisher. the observed drop in correlation did not indicate .clen. since the of concomitant circumsta. the bivae. In this situat~:n ~h ers :erf~rm poorly because of low performance among the ·h. Consequently. as a ~lwf the imper{~(.Ica !Jds~gnificant at some acceptable level . ame va I Ity coeffi' t ld throug~ chance fluctuatip. Because of ~higher admissibn standards. however.t may be interpreted in terms measurement discussed in : whl7h IS analogous to the . criterion. Had the difference$ in group homogeneity been ignored.'tests¢ores and criterion measures will be curtailed at the lower end of the :·bdistribution. arrow at t e lower end A .criterJIon meaSures of job performance will eventua11y be available. Thus..r:n~ o'b th~ bIVanate distribution.coefficients compll. on College Entrance Examination Board tests and high school records. that the predictors were less va-lid than they had been 30 years earlier.sumes that the relationship is linear and uniform throughout the range. This condition in ~g ~sco~g t~an. . '0' For the propet interpretation of a validity coefficient. some WIll permotivation. the range of such a group in both .m a true correlation of zero. " Validity coefficients may also change over time because of changing . Once this minimum is e:.lfIte~ ~st 6t may be a straight line. The computation of a Pearson correlation coefficient a.al present example.special difHcttlt}."'" . If we WIsh to predict an indiwill receive in college the as:: grade-point average a student of the standard erro. the validity can be expected to be somewhllt higher. a answer to thIS gr' .:::1 MAGNITUDE OF A V. fonn well in the course .have arisen Havmg establjshed a signiflcant p ng fro.. f onthe one hand. This correlation dropped from .Principles of Psychological Testing Validity: Mcasuremcnt and Interprctation ~65 the validity coefficients given in test manuals.y d: .\:er diagram obtained by plotting reading comprehension scores a!Ylinst criterion measures would show a rise in job perI fprmance up to the minimal required reading ability and a leveling off beyond that point.t n. the correlation was lower in the later group.. that such employees represent a superior selection of all "!hosewho applied for the job. Hence.the effe~t of such preselection will therefore be to lower the 'validity coefficient. correct y reveal the relative effectiveness of ".'h ere. although the act curacy with whkh individuals' grades were predicted showed little ch~nge.. In the subsequent use of the test.~ ~:fi~~t The error of estimate is found b th f 11 . am?ng the low-scoring stuscedasticih. b Ivanate distribution itsdf ill 11 . of error to be ex~irni1ar1y. . and average freshman grades.ted over a 3D-year interval with Yale students (Burn"ham. of course.' Th p.ula: ". these' conditions may not be met .Aj. This would be an example of a nonlinear relation between test and job performance. the entries would cluster around a curve rather mind when interpreting In other situations th 1" f b individual entries m. mterpretation of a validit ffi . good indication of the e ween test and 't' E and expectancy charts also I cn erIOn.tls of sam Ii . h~ how-scoring students will perform • . . before sonably certain that the obt' d el~d~ Ity of a test. There is evidence I I that in certain situations. An examination of the bivariate distributjon or scat. on the other.error of that the errOr of measure.rotto be expe~tec:l validity of the test. . ' . but the at the lower end of the s:~ e Sart er around this line at the upper than aptitude test is a necClisa. a particular job may " require a minimum level of reading comprehension. ~ut u~:se that 'performa~c::e on a scholastic achievement in a course Th t' h a tufficlent condItion for successful poorly in the cOU"se' bl!lt'a' a IS t. . Il. It is ~likely.LIDITY·COEFFr • coefficient be? No gener I CIE~T.11 to . attention should alm be given to the form of the relationship between test and criterion.tceeded. .:~. Correlations were found between a predictive index based . xpectancy tables the test at different levels.

the error of predicted scores is.a t~st for a pathological condition is reported ~positive if the condltion 1S present and negative if the patient is Dormal. It will be noted that errors in predic:te:d criterion score that do not affect the decision can be ignored. a ~ues~. then VI . BASIC APPROACH. The minimum acceptable job performance. \\ hen examined in the light of the error of estimate. is indicated in the diagram by a heavy horizontal line.u th. but rather to determine which individuals will exceed a certam mlmmum standard of performance.r'xv ig equal to 1. or that Beverly ~ruce WIll succeed as an astronaut? Which applicants are likely to be satlsfactory clc::rks. i. .30 may Justify lncluslon of the test in ~ selection program.60. The correlation between these two variables is slightly below . . thereforc. Some of these procedurcs will be illustrated in the following section.r'''11 . The predlc~ve lmprove~~nt attributable to the use of the test would thus be rol. servesto indicate the size of the error relative to the error that wou~ . the 60 above the line.00. whl~h 1S unusu~lIy high. with a test having zero validity.. the error of estimate would be zero. Let us suppose that 100 applicants have been given fln aptitude test and followed up until each could be evaluated for success on a certain job. . in the cri:erion. . For a complete evaluation of the effectiveness of the test as a screening mstrument. and he range of prediction error is as wide as the enbre distnbutIOn of criterionscores. and -the error is 60 percent aslarge as it would be by chance. the 45 individuals falling to the right of the heavy vertical line would be chosen.an:es that Mary Greene will graduate from medIcal school. A test may appreciably improve predIctive effiCIency If It sho~s a~1J significant correlation with the criterion. than. in whi~ . or machine operators? Such information is ~seful not only fo~ ~roup i selection but also for individual career planmng.45).e pnmary ~~ctl~n of psychological tests were to predlct each mdIvl~ual ~ exact l?OSlhO~in the criterion distribution.6 Prillciples of Psychological Testing Validity: AI easuremellt and Interpretation 167 : whichr2 >'V is the square of the validity coefficient and Uv is th~ ~tandard eviatiol1f the criterion scores.(~-1:hepresence ?f ~ ?athologJ~1 condition. the use of s~ch a test enables us to predict the individual's critenon performance wlth a marginof error that is 40 percent smaller than it would be if we were to guess. or false acceptances. even if we are unable to estimate WIth certamty whether his grade average will be 74 or ~I: . evaluation . This increase is attributable to the use of the test as a screening instrument. If the validlty coefficientio. Hence. however. as when brain damage~.80. will show that the term VI . This terminology is likely to be COD- = .of tests in terms of the error of estimate is unrealistically stringent. Reference to the formula for cr •• t.e. tI:at Tom Hlggms ". the error of estimate is ~s . or cutoff point. lt 15 advantageous for a student to know that he has a gOO? chanc~ of pas~ing all courses in law school. ho~ev~r.salesmen."XI/ is equal to . Conslderation must be given to other ways of evaluating the contribution of a test. Under these conditions...lar~e as it would be if we were to guess the subject's score. In most testing situations. This is the category of false re.84 rather than 60 (i. It will be noted that if the vahdlty were o erfect(r >'V . . Within this group of 45.. result from a mere guess. lf v'l. the prediction is no better. or criterion cutoff point.70.e. For many testing purposes. The latter term has been adopted frO. another category of cases in Figure 17 must also be examined.ections.::: ulIVI -0 = v).Un. without reference to test scores. Opl)' those prediction errors that cross the cutoff line and hence place the individual in the wrong category will reduce the selective effectiveness of the test . falling below the heavy horizontal line.20 or .der.conside~abl~. For example. which take into account the types of decisions to be made from the scores. In other words. A false positive thus refers to ~ case in ~hich the test erroneously 4l~~atf.certa~n Cltcumstanees even validities as low as . These false rejects in a personnel selection situation correspond to the false positives in clinical evaluations. the outlook would be qUite dlscouraglOg. the percentage of job successes is now. What are the ch.00). with zero validity. that the test scores are used to select the 45 most promising applicants out of the 100 (selection ratio.. however 10w. ..-~ mdicated in an mdlVldual who lS actually normal. job successes. The 40 cases falling below this line would represent job failures. it can be seen that there. lt IS not necessary to predict the specific criterion performance of mdlvl~ual . 60 percent will succeed on the job. Suppose.'in pass a course in calculus. Figure 17 shows the bh'ariate distribution of test scores and measures of job success for the 100 subjects. From these data we would estimate that 22 percent' of the total applicant sample are potential job successes who will be lost if the test is used as a screening device with the present cutoff point.c~ses. In such a case. mos~ t~sts do not appear very efficient. To put it diffe:ently.84). Between these two extremes are to be found the errors ofestimate corresponding to tests of varying validity. 38/45 . comprising the 22 persons who score below the cutoff point on the test but above the criterion cutoff. On the other and. It would thus appear that even with a validlty of . . the error of estimate is as large as e standard deviation of the criterion distribution (ucBr.::: . If all 100 appli~ants are hired. if a smaller number were hired at random.80. and 38 job successes. Similarly.arc 7 job failures.J:lk medical practice.::: 1. the proportion of successes would probably be close to 60 percent..

In settin on a test.. Statistical decision theory was developed by Wald (1950) with special reference to the decisions required in the inspection and quality control of industrial products. as many qualIfied ~ personsas possible. it may be necessary to hire the top 40 percent of applicants in one case and the top 75 percent in another. Increase in the Proportion of "Successes" Resulting from the Use of . La Forge. it may be more important to admit. ~ information required inc1\ip s'the validity coPREDICTION OF OUTCOMES. and the ur ('nc or seed with which t .psychologi. An example would be a commercial airline. I~ mor. The increase in percentage of successful employees from 60 to 84 could be used as a basis for estimating the net benefit resulting from the use of the test. attention should be ven to the percentage of false rejects (or false positives as we as to the . whereas In pers~n~el .' 'i. In the terminology of decision theofy.~.. filled. fusingunless we remember that in clinical practice a positiv~ result po a . the example given in Figure 17 illustrates a simple strategy. A few of these ideas were introduced into testing before the formal development of statistical decision theory and were later recognized as Dtting into that framework. I I I Job Successes Criterion Cutoff Job failures Low Low Test Score ~'FIC. 1965. the selection ratio is determined by the practical demands of the situation. Many of its implications for the construction and interpretation of psychological tests have been systematically worked out by Cronbach and GIeser (1965).> case the number of false rejects can be reduced by the choice of a lower .obS of such I !: a nature that a poorly qualified worker could cause senous loss or dami age. These procedures make it possible to take into account other relevant parameters."i.erc:nt-i ) a cesses an ai ures wit in t~_se eete grou. " A precursor of decision theory ini.1testing is ~o b~ found in the Taylor-Russell table~( 193.~). Essentially. to e~clu?e all but '.. In many personnel decisions. at the risk of including more fallures . Some of the basic concepts of decision theory. and the lI~e. a Selection Test.Validity: Measurement and Interpretation 169 . Guttman & Raju. Because of supply and demand in filling job openings. the number of job r. Rorer. Under o. 17. and few are in a form permItting theIr Immediate application to practical testing problems.which permIt a detennmation of -the net gain in selection acc~racy atbibutable to the use of the test. a strategy is a technique for utilizing information in order to reach a decision about individuals.s~ecified circumstances. Other factors that normally determine the position of ~he .e. When the selection ratio is not externall. . ca. test denotes pathology and unfavorable diagnosis. pilot. This would be the case when t~~'. selectiona positive result conventionally refers to a favorabJ~ prediCtIon : regardingjob performance.T imposed.th~ory a~e often quite complex. TIus can be done roughly by comparing the distrl ution of test scores in the two criterion groups. scores. 1966).!} In certam SItu. ations. Hoffman.the cutoff point should be set sufficiently higt. decision theory is an attempt to put the decision-making process into mathematical form. In tTllscase. I '~ openin s. so thdt available information may be used to arrive at the most effective decision ~nder . 1966. In the latter '. . or plan for deciding which applicants to accept and which to reject. l . the strategy was to accept the 45 persons with the highest te. cutoffscore include the available personnel snpP4:. such as the relative seriousness of false rejections and false acceptances. More precise mathematical procedures for setting optimal cutting scores have also been worked out (Darlington & Stauffer.ther :' circumstances.~eral terms. The mathematical procedures employed lD.' a few possible failures. & Hsieh. d~c1Slon. the cutting smre 011 a test can be set at that point giving the maximum differentiation be. academic achievement. however. are proving helpful in the reformulation and clarification of certain questions about tests. for example. tw~ Clilelioll grouEs.cutoffscore.

95 . if all applicants had to be accepted. For urposes of illustration.99 1.e selection .63. Table 14 shows the anticipated outcomes when the base rate is .ere 100 percent.65 .86 .61 . when as many as 95 percent of applicants must be admitted. For other base rates.66 .63 .00 1.63 .61 .7"'J2'-':UliH~~'..60 .64 .83 .69 .63 1.64 . even a test with perfect validity ( r = 1.qpulation tested (Buchwald. and the proportion of successfu~ app lc~n~ :: ~~r:e thout the use of the test (base rate).64 .82 .71 . 1939)..63 .61 .86 .73 .99 .00 1.65 .50 .00 1.62 .:.00 .85 .the use of the test is 25 when the base rate was 50. In applying the Taylor-Russell tables.60.00 .62 . the base rale refers to the proportion of successful employees prior to the introduction of the test for selection purposes. .:teq on the basis of previous job history.72 .63). 1959..61 .00) would raise the proportion of successful persons by only 3 percent (..00 .73 .94 .75 .80 .62 .00 .64 .62 .80 .JI.00 .97 1. and interviews.95 1. we need to consult the other appropriate tables in the cited reference (Taylor & Russell.88 .60 .92 .66 .60 .66 .64 .. . one of the Taylor-Russell tables has been e rod~eed in Table 14.78 .62 . =-" Selection Ratio . It indicates the contribution the test makes to the selection of individuals who will meet the minimum standards in criterion performance.69 .00 1.73 .86 .67 .90 . and from 9 to 99 in the third.62 .6J .30 can raise the percentage of successful applicants selected from 60 to 82. p.74 . for Base Rate .00 1.95 .aie or ercenta e of successful applicants selected pnor to the use of he test 1s 60.66 .basis at what the test adds to these previous selection procedures.67 .78 . test validity should be computed on the same sort of group used to estimate percentage of prior successes.80 .61 .72 .61 .74 .60 .68 . In the previously illustrated job selection situation.61 . the. This table is designed for us~ when the base . The rise from 60 to 82 represents the incremental vaUdity of the test (Sechrest.63 .Validity: Measurement and Interpretation o Principles of Psychological Testing 171 cient of the test.interest in clinical psychology.75 .97 .:>. when only 5 percent of applicants need to be chosen.67 .00 .71 .60.64 .~~.99 1.99 1.63 .ase rates were more extreme. Ot~er tables are prOVided by Taylor and Russe~l for ~t~~r base ra~es Across the top of the table are given different va ues ~ .95 .70 .60 .69 . the difference between .00 1.00 .62 .66 .98 .62 . from 10 to 21 in the second.84 . and along the side are the tes~ validities. .atio. (From Taylor and Russell.60 and anyone table entry shows the increase in proportion of successful selections attributable to the test. could improve the selection process.65 .40 and the selection ratio is 70 percent.75 . the contribution of the test is not evaluated against chance success unless 'applicants were preViously selected by chance-a most unlikely circumstance.65 . 576) ~':~~.91 .88 . where the base rate refe~ to' the frequency of the patholOgical condition to be diagnosed in the.30 .63 .63 selected after the use of the test.00 1.00 1.66 .". a test with a validity coefficient of only .76 .97 . .60 .61 .96 .00 1.66 .70 .60 to .':5. of course.:::·.:~!M.~~~ _ i ( lven .r •• .60 ·60 . A change many 0 t I" ctors can alter the predictive efficiency of the test.63 .65 .40 . the proportion of applicants who m~~t be acclep~e~ lection ratio).67 .67 . Obviously if the selection ratio w.90 .68 ..75 .63 . howen'r valid.62.00 .93 . The entnes 111 the' body of the table indicate the proportion of successful· persons 14 f G' 0 TABLE Proportionof "Successes" Expected through the Use 0 Test Validityand Given Selection Ratio.99 1. Thus. Under these conditions.66 .62 . letters of recommendation.77 .63 .63 . The implications of extreme base rates are of specia~. no test.75 . the improvement in percentage of successful employees attributable tQ .80 .63 . The incremental validity resul~~~ from the use of a test depends not only on the selection ratio but l\~'()ll the base rate. what would be the contribution or incremental validity of the test if we begin with a base rate of 50 percent? And what would be the contribution if we begin with more extreme base rates of 10 and 90 percent? Reference to the appropriate Taylor-Russell tables for these base rates shows that the percentage of successful employees would rise from 50 to 75 in the Hrst case. p. 1963). that is. Let us consider an example in which test validity is . Reference to Table 14 sho\\'s that.00 1.90 .68 . On the other hand.71 .60 . or the increase in predictive validity attributable to the test.61 . Thus. In other words.92 . the contribution of the test should be evaluated ODe.70 .81 . If applicants had been sele<.63 . but only 11 and 9 when the b.70 .99 1.

. several investigators addressed themselves to this question (Brogden.~population.. J.. If the cutoff score is set high enough (high scores being favorable). the use of a test may prove to be unjustified when the cost of its administration and scoring is taken into '.:. .. Cureton. account. This solution would be appropriate. Meehl & Rosen.t1965. Brown & Ghiselli. 1 ~ c... tests of moderate validity may be employed in an early stage of sequential decisions. all cases might first be screened with an easily administered test of moderate validity.. 1953.( It:l ~ 0 ~ It:l ce: 0 ~ 1:-: 0 I.) c ·8 II) . the improvement resulting from the use of a test of validity . base rates are closest to 50 percent. but on overall output of the selected persons. 1946.50 is 50 percent as great as the improvement expected from a test of perfect validity. Brogden (1946) first demonstrated that the expected increase in output is directly proportional to the validity of the test. predictive or diagnostic accuracy. negligible. what is wanted is an estimate of the effect of the selection test. In a clinical situation. Expressing criterion scores 1 o ~ A table including more values for both selection ratios and validity coefficients was prepared by Naylor and Shine (1965). The latter can then be detected through a more intensive individual examination given to all cases diagnosed as positive by the test. It:l It:l ~ es III 0 C.i'wfth rare pathological conditions. 1965).2 Qj:8~ tn IX: . there will be few false negatives but many false positives. if 5 percent of the intake population of a clinic has organic :brain damage. 1957a. Thus.Princillies of PSljcllological Testing . The number of false positives. for instance. Jarrett. How does the actual level of job proficiency or criterion achievement of the workers hired on the basis of the test compare with that of the total applicant sample that would have been hired without the test? Following the work of Taylor and Russell. or normal individuals incorrectly classified as pathological. or normals diagnosed as pathological. '''ith the extreme base rates found . this cost would include the time of professional personnel that IDlght otherwise be spent on the treatment of • additional cases (Buchwald. Although the introduction of any valid test win improve :~. the improvement may be . when available facilities !Jlake the intensive individual examination of all cases impracticable. Richardson. 0 C'1 \I') . For example. 0 0 . In many practical situations. 1944). then 5 percent is the base rate of brain damage in this . 1973). S. Under these conditions. Wiggins. 1948. the improvement is greatest when the .e- ~ ~ :2 ~ 0 RELATION OF VALIDITY TO MEAN OUTPUT LEVEL. For :)example. The relation between test validity and expected rise in criterion achievement can be readily seen in Table 15. not on percentage of persons exceeding the minimum performance. would of course increase this overall cost in a clinical situation. however. 1955. "'Then the seriousness of a rare condition makes its diagnosis urgent.

22 18. t' d or expensive equipment would llee d a h'Igh er Uln~g. e er. including ability of he acceptances and valid and false rejections. adding respondin to h u~ ou comes.00. the base output an. oun om t e number of persons in in that example th ns ofbFIgu:e. and other relatively intangible factors.strategy could then be found by multiplythese products forOt~:c faoutco~e by the utility of the outcom~. Since there were 100 applicants .' Wiggins"0973). Again.38 It is characteristic of decision theory that tests are evaluated in terms of their effectiveness in a specific situation. mg steps II! these computations. Administer test and C1pply False Acceptance . A$imple Decision Strategy.. The probeac outcome can be f d fr h each of the four sectio . that decisi:: Ut~:~ste~: It a~ been repeatedly pointed values into the d .. ry Id not mtroduce the problem of eClSlon process. however.40. the judged favorableness or unfl\. group administration An 1· IdV:~dun1tramed personnel. 18 gIVe t e probabilities of utilities of the diff gure. (selection ratio == .50 yields a mean of .70 SD above the expected base mean of an Illitested sample. and suitable for . The reason for the difference is that prediction errors that do not affect decisions are irrelevant to the selection situation.cantsare hired. Educational decisions must take into account institutional goals. a validity of . here are four possible outcomes. 2 For . For instance.25 yields a mean criterion score of .33 False'. 17. pp. a to justifyexaminer vahdlty rame its use. To illustrate the use of he table. . cutoff eClSIon to acceptt or ~Jec t an app Iicant is made on the basis of a score on the t valid and fals es.20) by means of a test whose validity coefficientis.vhic~ la~ralm sho~s the decision strategy illustrated in Figure the' d " a smg e test IS administered to a group of applicants and . Similar direct linear relations will be found if other mean criterion performances are compared within any roW of Table 15. just twice what it would be with the test of validity . and subtracting a value cor- lY' = rh t test of val:d~.16. with a selection ratio of 60 percent. let us assume that the highest scoring 20 percent of the appli. a dollar-and-cents value can frequently be asSigned to different outcomes. if Smith and Jones are both superior workers and are both hired on the basis of the test. Individual decisions. The ing the probability h e. e other data needed are the erent outcomes expre d expected overall utili of th ' sse on a common scale. For example.w fl'ctitious example illustraf all . The evaluation of test validity in terms of either mean predicted output or proportion of persons exceeding a minimum criterion cutoff is obviously much more favorable than an evaluation based on the previously discussed error of estimate. The lack of adequate systems for assigning values to outcomes in terms of a uniform utility scale is one of the chief obstacles to the application of decision theory. a . Using a test with ero validity is equivalent to using no test at all. while a validity of .performance of this group is . Even in such cases. see y. t d . this table gives e expected mean criterion score of workers selected with a test of given idity and with a given selection ratio. 10: r excels Smith.Validity: Measurement and Interpretation Principles of Psychological Testing standard scores with a mean of zero and an SD of 1. \Vith the same 20 percent selection ratio and a perfect test (validity coefficient 1.50.~:t.00). 257-274. Such evaluation takes into account not only the validity of the test in predicting a particular criterion but also a number of other parameters. is given in the column for zero validity.enee to the schematic representation cedure Th' d~Islon strategy m FIgure 18 \vill help to clarify the pro17 in :. . but merely made it explicit Value-:"· ways enter d . Reference to Table 15 shows that the mean criterion . public relations. Decision Outcome Valid Acceptance Probability .~r:e~:~:r t~h~s last ~erm ~i?h~ights th~ fact that a easily administered by reIat' e r~tamed If It IS short.32. mexpensive. doubling the validity doubles the output rise. social values. s. . ~ tems h ave a 1 clearly re~gnized or sy:te:a~ica~~s~:~dl~~~ they were not heretofore In choosmg a decision strate th 1 . including base ra:e and s~~ Another important' parameter is the relative utility of expected outcomes.07 cutoff score V~lid Rejection . the mean criterion score of the accepted applicants }vould be 1. In this context. utilities across all outcome R e goa IS to maximize expected of a Simple de . and employee morale are difficult to assess in monetary terms.vorablcness of each outcome. . In industrial decisions.50. . certain outcomes pertaining to good will. must consider the indiTIlE ROLE OF VALUES IN DECISION TIIEORY. as in counseling. however. ese num ers dIVIded b 100' h the four outcomes listed in Fi . corresponding to the performance of applicants selected without se-ofthe test.Rejection FIG. n IVI ua test req . it does not matter if the test shows Smith to be better than Jones while in job performance Jones 175 vidual's preferences and I h out.

Rock.fuller discussion of the implications of decision theory for test use. Incompetent employees hired because of prediction errors can usually be discharged after a probationary period. see S. 1973): Altho~gh. incorrect selection decisionS. Essentially. Linn. 6.-& 13etz. possibly pathol~gi~al) ~y the. ~n connection Wlth the use of tests to diag. . To be sure. a~ cases clas~ified as .. }Jut they are often less costly than terminal wrong decisions. I DIFFERENTIALLY PIlEDlCTABLE SUBSETS OF PERSONS. " ".. ~. A second condition that may alter the effectiveness of a psychological test is. Those who score poorly are routed t? easIer items' those who score well. 1974. shows a two-stage sequential decision. '. 1969. there has been increasing exploration of prediction models involving interacti~ hetween persons and 3 J. on this test. The validity of a test for a given criterion may vary among subgroups differing in personal characteristics. it is only adverse selection decisions that are terminal. repre. f t to t'm (DeWItt & \Velss . per~orma~ce . T~st A could be a . but to test further. '.that are later rectified may be costly in terms of several value systems. the availability of alternative treatments and the possibility of adaptmg treatments to individual characteristics. everyone might begm w1th a set of Ite~s of intermediate difficulty. to more difficult items. in connection with the utlhzahon of computers 10 group testing. . Figure 19. -mOre wiI[ be Sald about tlleTequired methodology in a later section on classification decisions. on the " other hand. Such sequential testing can also be cmployed within a si~gle test~ng . Cronbach and GIeser . or thc introduction of CQ. in'dividuals would be sorted into three categ?nes. With the flexibility of ap~roach ushe. On the basis of t~e second-sta?e testmg. is to use only two categories. It should also be noted that many personnel decisions are in effect sequential. & Cleary. Be£ause. The validity coefficient alone cannot indicate whether or not a test should be used ~ince it is only one of the factors to be considered in evaluating th~ lmpact of the test on the efficacy of the total decision process) .etoe~s. ..s WIll be dlscusse~ further in Chapter 11. positives (i. appli~ cable to paper-and-pencil printed grou~ ~ts.l!lpensatory educational programs for students with certain educational disabilities.prel~mmary sc~eem~g test.. An example would be the utilization of different training procedures for workers at different aptitude levels. When adaptive treatments ar~ utilized.'.nee s ownperfom1anceo For example.nose pathological condItIons With very low base rates. and at a more technical level. Sequential testing ~~del. This is the strategy cited earlIer ll1 this.e~t is that each examinee attempts only those items suited to h~s abJ1l~y level. (1965). count. ~ssenhally the sequen~e ~f items ~r item groups 'within the test is determine? b~ the examl. Two examples will serve to illustrate these poss~blhtles.t t may be used to make sequential rather than termmal deCISIOns. Under these conditions.the effectiveness of a test may be increased through the use ~f more complex decision strategies which take still more param.e.SEQUEXTIAL STRATEGIES AND ADAPTIVE TREATMENTS.~ust. Weiss. s. In some situations. :"decisions to accept or reject ar: treated as terminal.. Wlggms (1973).With the simple decision strategy Illustrate III 19ures an . repeatedly at several stages. es sod' F' 17 d 18 aU . wouldbe sorted into accepted and rejected categorIes. The classic psychometric model assumes that prediction errors are characteristic of the test rather than of the person and that these errors are randomly distributed among persons. the success rate js likely to be substantially improved. rather than trying all items.e~tion. hI' I d' d Another strategy.lllto . although they may not be so perceived. On the baSIS of. For a . failing students can be dropped from college at several stages. tIllS group . the assignment of in<!ividuals to alternative treatments is essentiallyadilsSif'ication rather tharu-sel~oblem.> session. In such situations. Ch. The princip~l eff.'. .: mcludl~~ those clearly accepted or rejected.shortand easilv administered screening test.. 3n in~ermedlat~ uncertam group to be examined further with more intenSIve tec~mque~.ac. the decision strategy followed in individual eases should take into account available data on the interaction of initial test score and differential treatment.~ \ . seq~entIal testmg IS particularly well suited for computer testing. . .' . as well s.. Such branchmg may oeeu. The examples cited illustrate a few of the ways in which the concepts an~ rationale of decision theory can assist in the evaluation of psychologIcal tests for specific testing purposes.. decision theory has served to focus attention on the complexity of factors that determine the contribution a given test can make in a particular situation.re~ in by decision theory.to ma'Ximize the effectlve usc 0 es mg Ie. suitable for the diagnosis of psye 0 ogICa 1~or ers. sented by Test B.

The individual who is characteristically preClSe and careful about details. The characteristic behavior of the two types tends to make one type careful and accurate in reporting symptoms. he will probably perform poorly regardless of his scores on relevant aptitude tests. who investigated the relation between intraindividual variability on a test and the predictive '-'ilidity of the same test. the correlation between aptitude test score and job success may be quite high. When the validity of the ~ = = . these interest differences wO. The test could thus be used effectively in making declSJons regardmg persolls in the first group but not in the second. Moreover. careless individual who tends to avoid expressing unpleasant thoughts and emotions and who llses denial as a primary defense.J applicant sa'ijl~ was only . t may e a emograp lC vana e. but it may be interesting to speculate about it in the light of other known sex differences. 1963. The group was then sorted into tpirds qp the basis ~ ~~ . 1960). the college grades of the more anxious students correlated higher (r . 1968. UidiIstrial situations. same trend was founa in high sChool and college. 1956). If. sincethey moderate the validity of the test (Saunders... For xamplc. age. these xamples. although the trend was more pronounced at the coll~ge level. the @rrelati~n between an aptitude test and a job-performance criterion in the t6tl. If women students in general tend to be more conforming and more inclined to accept the values and standards of the school situation. . 1960. Seashore (1962) found htgher correlations for women than for men in the large majority of instances. there is evidence that self-report personality inventories may have higher validity for some types of neurotics than for others (Fulkerson. 1960b. G. Ghi~elli (1956. on the other hand. 1959). Among such persons.al level. Y{. H...19). 1967) has extenslvely explored the role of moderator variaBles iIl. 1966) tested the hypothesis that the more compulsive students. In a study of taxi drivers (Ghiselli. In a different context. the relation proved to be more complex than anticipated (Berdie. on the other hand. but not among liberal arts students of either sex. .dicate.. validity may be high in one subset and negl1g~~le In anot~er. Frederiksen & Melville. Since effort will be reflected in grades.ouldput a great deal of effort into their course work. Tht. teristic. Stricker. In another study (Grooms & Endler. Thus. who tends to worry about his problems. In. lack of agreement among different indicators of compulsivity casts doubt on the generality of the construct that was being measured.. or it may be a score on another test. men students tend to concentrate their efforts on those activities (in or out of school) that arouse their individual interests. Interests and motUlation often function as moderator variables. But when reo < computed in subsets of individuals differing in some i~e~tifia?le charac.ient of a test may 'be too low to be of much practical value In prcdlction. 1954.63) with aptitude and achievement test scores than did the grades of the less anxious litudents (r . on an occupational interest test.Jld introduce additional val'ianee-in their. Although the hypothesis was partially confirmed. 1956). It was hypothesized that a given test will be a. Whatever the reason for the difference. regardless of their interest in the courses. OF V Evidence for the operation of moderator variables comes from a variety of sources. In a survey of several hundred correlation coefficients between ap~tude test scores and academic grades. theiJ. A moderator variable is some characteristic of persons that makes It i posS'ibfeto-pre'ct e pre ictability 0 I erent 10 ividuals with a given ins rument. the correlation between aptitude test scores and job performance would be low. identified through two tests of compulsivity. Several studies (Frederiksen & Cilbert.. 1969).class achievement will probably devend largely on their abilities. This hypothesis was confirmed in several groups of male engineering students. Chise~!C Sander~.!.sex and socioeconomic level are known as moderator vanables. 1960a.. Such interaction implies that the same test may be a better pretor for cert~i~Ciasses or subsets of persons than it is for others. A different approach is illustrated by Berdie (1961). I When computed in a total group.courseachiev~t and would make it more difficult to predict achievement from test scores. A number of investigations have been specially designed to assess the role of moderator variables in the prediction of academic achievement. Per~aps anothe~ test or " some other assessment device could be found that IS an effective predictor in the second group.' ts. such as sex.better predictor for those individuals who perform more consistently in different parts of the test-and whose total scores are thus more reliable. sex does a ear to function as a moderator variable in the predictability of academic gra es from aphtu e test scores. but that the effort of the less compulsive students would depend on their interest. or a better predlctor for applicants from a ower than for applicants from a higher socioeconomic level. or socioeconomic background. if an applicant has little interest in a job.a given test may be a better predic~or of criterio~ performance or men than for women. ! 178 ~ Principles of Psychological Testing Validity: Mr:a~'U"C11lentand Interpretation 179 EMPmlCAL EXAMPLES MODERATOR ARIABLES. the o~her ~areless and evasive. the correlation between the appropriate interest test scores and grades should be higher among noncompulsive than among compulsive students. the reason for this sex difference in the predictabIhty of academlc achievement. For individuals who are interested and highly motivated. and who uses intellectualization as a primary defense is likely to provide a more accurate picture of his emotional difficulties on a self-report inventory than is the impulsive. ~he ?~ta do not in. the vali~ity coe!R<. e u.220.

21 V + . . Most: criteria are complex. It has aly been pointed out.on the di~ert:n~ tests are to be combined in arrivi. Alternative procedures. .. The sum of t~c.'i.21)(6) : + (. ~Vhe~~ts ~re adIriinistered in the intensive study of individual cases. == (.gle.problem Validity: Measurement and Interpretation 181 aptitude test was recomputed within the third whose occupational interest level was most appropriate for the job.?1t~/et-.664. Bill would thus be expected to ~o somewhat better than average in mathe~tics courses.. numerical (N). rather than a single test consisting of a podge of many diffe:rent kinds of items. Such clinical use of test scores will be discussed \ further 1ll Chapter 16. A single test designed to measure a criteriQn would thus have to be highly heterogeneous."". The predictability -: scaleis subsequently applied to a new sample. 1Jpe chief arlsmg 10 the use of such batteries concerns the way 'in which scores . 1972a. and \ theoret~cal ratIOnale to interpret score patterns and integrate findings \ from dl~erent tests. 1972b).fl.actice.~~ homogeneous _~~st.. .. I ".Gi/ltt/Vl.::!.e_iL)'ieIasJess--.n' an explor.subgroups selected on the basis of their D scores.::. to identify highly preo dictableand poorly predictable subgroups. --------'--:. plus a constant (1.1 \ v(.. 1972. '. e.. en a number of speciaUy selected tests are employed together to Math.. it is usually preferable to use a ination of several relatively homogeneous t~sts. '1'1. Dunnette.'been developed to determine in advance which of two tests will be a '\ better predictor for each individual (Chiselli. is then developed by comparing the item responses of two contrasted :. testis compared in these two subgroups.--. multiple regression equation and multiple cutoff scores. .)). The. £ II .. 1973).. Abrahams & Alf.:. I _-1.sepr~~uct~. they are known as a test batten(. counseling. Other investigators (Dunnette. U ------.analpis.35).~dictionerror without regard to direction of error.tory ·phase. \ :: The multiple regression equation YIelds a predicted cntenon Score for each individual on the basis of his score . A predictability scale ~. and the validity of the original . _ SpecIRc techmques for the computation of regression equations can be 1 i: \ (i 11 I illl' = 'I.:.21 N + . This approach has shown considerablepromise as a means of identifying persons for whom a test will " be a good or a poor predictor. 1972._-' ._~ /?I?'e:-T~!'~redict Of:tlp. 'I In t~IS ~quabon. morepredictable is the individual's criterion score.82 R + 1.VV1C 111 ~li~lCaldiagnosis. The results are usually ~9uite specific to the situations in which they were obtained. Considerable caution is_required to avoid methodologicalpitfalls (see.::.)' L :1"11 . Hence.'Suppose that Bill Jones receives the following stanine scores: Verbal 6 1 I: :1 1 . each covering a ent aspect of the criterion. a s~n.. . or the evaluation of high-level execu//' I ' ". The smaller the value of D. that~ re~!i.35. argued that Chiselli's D index. 1972b. namely. a te ~L1e 10 owing regression equation illu~trates the applIcation of this technique to predicting a student's achIevement in high school mathematics courses from his scores on verbal (V). And it is iinportant to check the extent to which the use of moderators actually 'proves the prediction that could be achieved through other more 'rect means (Pinder.35 = 6. Hobert & Dmmette. suringlargely' a singlet~ is more satisfactory b~~. It Isa£.~-=..f ir - '--"1 ':' J Ii. the examiner relies on judgment.{li.ves...fOLtb~aminer to utilize test scores with~1 out further st~tistical. however.\ Atthis time the identification and use of moderator variables are still .QmIDOlLpr.g. it rose to . 1960a). the criterion measure de. . the .. the student's stanine Score on each of the three tests is multiplied by the corresponding weight given in the equation. not one but several tests are the eperallyrequired. 1972a.!!g at a decision regardmg each IndiVIdual.i! J 111:1 'J.'. ---US-Scores ('Ch-:. EQUATION. b t . ing on a number of different traits." . A technique employed by Chiselli in much of his research consists in finding for each individual the absolute difference (D) between his :.-W preparing a case report and in making! recom~endatI~ns. '\ .: ~::~~~ ':. His very supenor performance in the reasoning test (R =8') and his above~verage score on the verbal test (V = 6) compensate for his poor score 10 spee~ and a~uracy of computation (N 4). actual and his predicted criterion scores. gives the student's predicted s~!!~ne pOSItIon lD mathematics courses.21) ( 4) +( . Achiev. and reasoning (R) tests: • MULTIPLE ~RESS~ON...:=___.::-.1972. 'V~. may obscure important individualdifferences.Ghiselh. Velicer.180 Principles of Psychological Testing . 4) that a stanme of 5 represents average pedormance. 1967) have "'. involving separate analyses of overpredicted and underpredicted cases. Math~matics Achievement =: .'1 'i The estimated ma'h Lema ti'cs ach'levement of this student is found as follows: 1 ' i. II fill Bill's ~redictcd stanine is approximately 6.criterion.:.(.32)( 8) 1 + 1. past experience. based on the absolute amount of pre. :. :If '. An extension of the same procedure has .xFor prediction of practical criteria.fJ701(... . have accordingly been pro'posed. It ~l be recalled (Ch.g major typ:s. statistical procedures followed for this purpose ~re of tw.01 'lii.

Sign up to vote on this title
UsefulNot useful