You are on page 1of 38
Psychological Testing and Psychological Assessment A Review of Evidence and Issues Gregory J. Meyer Stephen E. Finn Lorraine D. Eyde Gary G. Kay Kevin L, Moreland Robert R. Dies Elena J. Eisman ‘Tom W. Kubiszyn and Geoffrey M. Reed University of Alaska Anchorage Center for Therapeutic Assessment US. Office of Personnel Management Georgetown University Medical Center Fort Walton Beach, FL. New Port Richey, FL ‘Massachusetts Psychological Association ‘American Psychological Association This article summarizes evidence and issues associated with psychological assessment. Data from more than 125 ‘meta-analyses on test validity and 800 samples examining ‘multimethod assessment suggest 4 general conclusions: (a) Psychological test validity is strong and compelling, (b) psychological test validity. is comparable 10 medical test alidity, (c) distinct assessment methods provide unique sources of information, and (d) clinicians who rely exclu- sively on interviews are prone to incomplete understand- ings. Following principles for optimal nomothetic re- search, the authors suggest that a multimethod assessment battery provides a structured means for skilled clinicians to ‘maximize the validity of individualized assessments. Furure investigations should move beyond an examination of test scales to focus more on the role of psychologists who use tests as helpful tools to furnish patients and referral sources with professional consultation. to psychotherapy in terms of its professional im: portance (Greenberg, Smith, & Muenzen, 1995; Norcross, Karg, & Prochaska, 1997; Phelps, Eisman, & Kohout, 1998). However, unlike psychotherapy, formal assessment is a distinctive and unique aspect of psycholog- ical practice relative to the activities performed by other health care providers. Unfortunately, with dramatic health care changes over the past decade, the utility of psycho- logical assessment has been increasingly challenged (Eis- ‘man et al., 1998, 2000), and there has been declining use of. the time-intensive, clinician-administered instruments that have historically defined professional practice (Piotrowski, 1999; Piotrowski, Belter, & Keller, 1998). In response, the American Psychological Associa- tion's (APA) Board of Professional Affairs (BPA) estab- lished a Psychological Assessment Work Group (PAWG) in 1996 and commissioned it (a) to evaluate contemporary threats to psychological and neuropsychological assess- ‘ment services and (b) to assemble evidence on the efficacy of assessment in clinical practice. The PAWG's findings and recommendations were released in two reports to the BPA (Eisman et al., 1998; Meyer et al., 1998; also see EEisman et al, 2000; Kubiszyn et al, 2000). This article extends Meyer et al. (1998) by providing a large and system- atic summary of evidence on testing and assessment. Our goals are sixfold. Firs, we briefly describe the purposes and appropriate applications of psychological as- sessment. Second, we provide a broad overview of testing and assessment validity. Although we present a great deal of data, by necessity, we paint in broad strokes and rely heavily’ on evidence’ gathered through meta-analytic re- Views. Third, to help readers understand the strength of the assessment evidence, we highlight ndings in two compar- ative contexts. To ensure a general understanding of what constitutes a small or large correlation (our effect size measure), we review a Variety of nontest correlations culled from psychology, medicine, and everyday life. Next, to mote specifically appreciate the test findings, we consider Gregory 1. Meyer, Department of Poychology, University of Alaska Anchorage; Stephen E. Fin, Center for Therapeatc Assesment, Aus TX: Lorraine D. Byde, U.S. Orfice of Ponstel Management, Washing ton, DC; Gary G. Kay, Georgetown University Medical Centr; Kevin L. “Moreland, independent practice, Fort Walton Beach, FL; Robert R. Dies, independent practice. New Port Richey, FL: lena J. Eisman, Massach setts Psychological Association, Bosion, MA: Tom W. Kubiseyn and Geotfrey M. Reed, Practice Directorate, American Psychological Assoc ation, Washington, DC. Tom W. Kubiszyn is aow a the Department of Educational Psych ony. University of Tena at Austin Kevin L. Moreland passed away in 1999, We thank the Society for Pecsonality Assessment for supporting Gregory J. Meyers organization of the literature summarized in ths attic ‘Correspondence concerning this article shoul be adresse 0 Gre ‘ory J. Meyer, Deparanent of Psychology, University of Alaska Anchor fge, S211 Providence Drive, Anchorage, AK 99808. Electronic mal may be ent to afgjm@uan alaska ed The PAWG reports canbe oblaina fee of charge from Christopher 3. MeLaoghlin, Assistant Director, Practice Directorate, American Psy ‘chologieal Association, 750 Fist Sueet NE, Washington, DC 2000?- 44242; email: emelaughlin@apaonp. Because of space linations, this ftcle does not cover some important issues detailed in Meyer eta: (1998), 128 February 2001 * American Psychologist mI ea TCE SSS cen ses psychological test validity alongside medical test vatid- ity, On the basis of these data, we conclude that there is substantial evidence to support psychological testing and assessment. Fourth, we describe features that make test- ing a valuable source of clinical information and present ‘an extensive overview of evidence that documents how distinct methods of assessment provide unique perspec- tives, We use the latter to illustrate the clinical value of, 4 multimethod (est battery and to highlight the limita- tions that emerge when using an interview as the sole basis for understanding patients. Fifth, we discuss the distinction between testing and assessment and highlight vital issues that are often overlooked in the research literature. Finally, we identify productive avenues for future research, The Purposes and Appropriate Uses of Psychological Assessment Some of the primary purposes of assessment are to (a) describe current functioning, including cognitive abilities, severity of disturbance, and capacity for independent li ing; (b) confirm, refute, or modify the impressions formed by clinicians through their less structured interactions with patients; (c) identify therapeutic needs, highlight issues likely to emerge in treatment, recommend forms of inter- vention, and offer guidance about likely outcomes; (4) aid in the differential diagnosis of emotional, behavioral, and cognitive disorders; (€) monitor treatment over time to evaluate the success of interventions or to identify new issues that may require attention as original concerns are resolved: (f) manage risk, including minimization of po- tential legal liabilities and identification of untoward treat ment reactions: and (g) provide skilled, empathic asses ment feedback as a therapeutic intervention in itself. [APA ethical principles dictate that psychologists pro- Vide services that ae in the best interests of their patients (American Psychological Association, 1992). Thus, all as- sessors should be able to furnish a sound rationale for their ‘work and explain the expected benefits of an assessment, a8 well as the anticipated costs. Although it is valuable to understand the benefits of a test relative to its general costs, itis important to realize how cost-benefit ratios ultimately ccan be determined only for individual patients when work ing in a clinical context (Cronbach & Gleser, 1965; Finn, 1982). Tests expected to have more benefits than costs for ‘one patient may have different or even reversed cost- benelit ratios for another, For instance, memory tests may hhave an excellent cost-benefit ratio for an elderly patient with memory complaints but a decidedly unfavorable ratio for a young adult for whom there is no reason to suspect memory problems. This implies that general bureaucratic rules about appropriate test protocols are highly suspect. A test that is too long oF costly for general use may be essential for clarifying the clinical picture with particular patients, In addition, certain assessment practices that may have been common in some settings ean now be seen as questionable, including (a) mandated testing of patients on 2 fixed schedule regardless of whether the repeat assess- ‘ment is clinically indicated, (b) administrative guidelines specifying that all patients or no patients are to receive psychological evaluations, and (c) habitual testing of all patients using large fixed batteries (Griffith, 1997; Meier, 1994), Finally, although specific rules eannot be developed, provisional guidelines for when assessments are likely t0 have the greatest uility in general clinical practice can be offered (Finn & Tonsager, 1997; Haynes, Leisen, & Blaine, 1997). In pretreatment evaluation, when the goal is t0 describe current functioning, confirm or refute clinical im- pressions, identify treatment needs, suggest appropriate interventions, or aid in differential diagnosis, assessment is likely to yield the greatest overall utility when (a) the treating clinician or patient has salient questions, (b) there are a variety of treatment approaches from which to choose and a body of knowledge linking treatment methods to patient characteristics, (c) the patient has had litle success in prior treatment, or (d) the patient has complex problems and treatment goals must be prioritized. The therapeutic impact of assessment on patients and their interpersonal systems (i.e. family, teachers, and involved health service providers) is likely to be greatest when (a) initial treatment efforts have failed, (b) patients are curious about them- selves and motivated to participate, (¢) collaborative pro- cedures are used to engage the patient, (d) family and allied health service providers are invited to furnish input, and (e) patients and relevant others are given detailed feedback about results. Identifying several circumstances when assessments are likely (0 be particularly useful does not mean that assessments under other circumstances are questionable. Rather, the key that determines when assessment is appro- priate is the rationale for using specific instruments with a Particular patient under a unique set of circumstances to address a distinctive set of referral questions. An assess- ment should not be performed if this information cannot be offered to patients, referring clinicians, and third-party payers. A Foundation for Understan: Testing and Assessment Val Evidence To summarize the validity literature on psychological test- ing and assessment, we use the correlation coefficient as ur effect size index. In this context, the effect size quan- tifies the strength of association between a predictor test scale and a relevant criterion variable. To judge whether the test validity findings are poor, moderate, or substantial, it helps to be clear on the circumstances when one is likely to see a correlation of .10, .20, 30, and so on. Therefore, before delving into the literature om testing and assessment, (Gest contnaes on poge 132) Different issues re likely to come tothe forefront daring forensic ‘evaluations, although they are ot considered hee February 2001 + American Psychologist 129 EE Table 1 Examples of the Sitength of Relationship Between Two Variables in Terms of the Correlation Coefficient (1 Fredior ond een (sudy and nee] A ” 1. Effect of sugar consumption on the behavior and cognitive processes of children (Wolraich, 00 560 Wilton, & White, 1995; the samplesize weighted efect across the 1. measurement categories reported in their fable 2 was r= .01. However, none ofthe individual outcomes produced effect sizes that were significantly diferent from zero. Thus, r= 0.0 is reported os the most accurate estimate ofthe true effec 2, Aspirin and reduced rsk of death by heart attack (Steering Commitee of the Physicians’ Health 02 22,071 Study Research Group, 1988}. 3. Aniihyperiensive medication and reduced rsk of sroke (Psaty et l., 1997; the effect of treatment 03 59,086 ws acl smal forall fer disease end pins studied f., coronary hee disease, congesve hear faire, cordlovsclar moray, and ttl mrt) 4. Goamoterpy ond suiving brea concer (erly Breast Concer Tate’ Caeboroive Grow, 099,069 5. PostMi cardiac rehabilitation and reduced death from cardiovascular complications (Oldridge, 044,044 Guyot, Fischer, & Rimm, 1988; weighted effect calculated from data in their Table 3. Cordiac rehabilitation was not effective in reducing the risk fora second nonfatal Ml [r ~ ~.03; effect in 08, of when pairwise Aifferences were examined with post hoe Schefé tei. The unweighted ‘means rs were as follows: Saf report personality tests = 24 (SD ~ 18 fn = 24), performance personality tests (ce, Rorschach, apperce Storytelling tasks, sentence completion) = 33 (SD = 08, =e) cognitive or neuropsychological tests = 34 (SD = 19, n = 26), other psvchologial tests (ep observer ratings) medical tess = 36 (3D = 21.n = 63). 30 (SD = "08,0 = Thand February 2001 + American Psychologist 135 SS Table 2 9s of Testing and Assessment Validity Coefficients With an Emphasis on Meto-Analytic Results Pradicr and ereion lady ond noes 1 2 3 10. 12. 13. 14 15. 16. V7. 18 19. Dexomethasone suppression les! scores and response Io depression Weatment (Ribeiro, Tondon, Gruthous & Greden, 199312 ° ! Fecal occult blood test screening and reduced death from colorectal cancer (Tower eto, 1998) Routine umbilical artery Doppler ultrasound and reduced perinatal deaths in lowsk women {Gotny PorsUedo, Kigond & Broan, 1997; he ouhor ako oxcmnod the mnpon of routine umbilical artery ulrasound on 13 other measures of suecessiul oscome, The average effect size across these other eriterio was r~ ~ "0036 {ns from 6,373 1 11,375), with the lorgest correlation in the expected direction being .0097 [for Apgar scores ch 5 minutes] Routine uirasound examinations and successful pregnancy outcomes (Bucher & Schich 1993; outcomes considered were live biths [r = 0009]. no induced labor [r ~ 0176}, ro low Apgar scores [r = ~.0067], no miscarriages [r= .G0S4) and no perinatal moral (r= 168). .. MMPI Ego Strength scores and subsequent psychotherapy outcome (Meyer & Handler, 1997; this metacmnalyis considered ony stiches in which he Ego Srengih sole wos used along he Rarcoch PRS Routine umbilical anery Doppler ultrasound and reduced perinatal deaths in highisk women {Alfrovic & Nelson, 1995, the authors clso examined the impact of routine umbilical onory Ulrosound on 19 other measves of successful outcome. The Svorage ec sire actos These he oa wos" O18 whom 47607 Ara Denil rprenive coping spe ond deepen of ras cana MKern, Zevon, Com, & Rounds, 1999; weighted elec size computed from he sy data in helt Table Tiple marker” prenealsconing of motenel serum and tdentfcation of Msmy 18 (Yankowiz, Fukon, Williamson, Grom, & Budeler, 1998) Impact of gerictie medical esessment teams on reduced deaths (dota combined fom the mnetoanayas by Rubenslein, Suck, St, & Wieland, 1991, and he following more recent Studies Boub tcl, 1994; Bila etal 1999. Bums, Nichols Graney, & Coen 1995, Englehord et ol. 1996; Fobacher et al, 1994; Fretwell et ol, 1990; Germarn, Koel Wieland, & Robensen, 1995; Hansen, Poulsen, & Sorensen, 1995; Horse ly 1991; Korppt 8 Tiks, 1995: Naughion, Moron, Feingiass, Fleone, & Wilms, 1994; Reuben et [a1995; Rubensein, Jephson, Harker, Miler, & Wielond, 1995; Rubi, Sizemore, Loi, & te Mole, 1993; Siverman ef ol, 1995 Sivf a, 1996; thomas, Brehen, & Haywood, 1999; and Trenini et ol 1995; only the loes! available outcome deta wore used fer each comple) MNP! depression pro scores ond subsequent cancer within 20 yeors (Petsky, Kempthorne Rowson, & Sheba, 19671 Ventloory lng ction ts scores ond subsequent lang cancer within 25 years sl & Schononied, 1954) Rorschach iteracion Scale scores and subsequent cancer within 30 years (Groves, Phil Mood & Pearson, 1986, scores remind signtcan! predictors fir concing for baseline smoking, serum choleseral, ys blood pressure weight, and oge)= Unique contribution of an MMP highpoint code (vs. other codes) 10 conceptually relevant efter (McGrah & Ingersoll, 19992, 1999 NPL stores ond subsequent prison miscondct (Gendreau, Goggin, & law, 19971 Beck Hopelssnes Scale scares and subsequent suid (dcha combined fam Beck, Brown, Berchich Stewart, & Steer, 1996, and Bech, Steer, Kovars, & Garson, 1965) MMPI elovctions on ScolesF,6, oF 8 and ciminal defendont incompetony (Nicholon & Kogler, 1991) Exttoverson lest scores and success in sls [conciren! ond predictive; data combined from Borsick & Moun, 1991, Table 2, Salgado, 1997, Table 3, and Vinchar, Schppmon, Switcer, & Roth, 1998 [coeficiens om Weir Tables 2 and 9 were averaged, and he iorgest N was vod for tho over sample size), /Atenion ond Goncenaton Test scorer and rerdual mid head trauma Binder, Rohling, & torrabes, 1997) in corel concer, lck of glandular diferenttion on tissue biopsy and survival past 5 your rece, 1999; hs shdy reported wo metanalset, The ofr one fourd hat Tucleat ONA’Contont wos of no vee for predicing cancer progression nina low-grade Conical inroopthelal neoplio 00 o1 01 0 02 03 4 07 o7 08 08 08 09 2,068 329,642 11,375 16,207 280 7,474 12,908 40,748 10,065 2,018 3,956 1,027 8,614 17,636 2:123 1,461 6,004 622 685 136 February 2001 + American Psychologist, Table 2 (continued) Petia nd can yond) cena 20. Negative emotionality test scores and subsequent heart disease (BoothKewley & Friedman, 11 (k= 11) 19871 dota were devved rom her Table 7 wih nogaive emotionally defined by the weighted effect for anger/hostliy/aggresion, depression, ond anxiety 21. Triple morte prenaldl screening of nctrnal serum ond identification of Down's syndrome «1194.36 (Conde Aguile & Kofury-Gocto, 1998; results were reported ocross oll og) 22. Gonarel cognitive bility ond involvement in automobile acidents Arthur, Barre, & 12 1,020 Alexander, 1991). 23, Conscienfousness lst Scores and job proficiency [concurrent and predictive: data combined —«12—=—«21,650 from Borick & Moun, 1991, Table 3; Mount, Bari, & Stewart, 1998; Solgado, 1998, ble 1; and Vinchor el, 1998 [coetcient rom thir Tables 2 ond 3 were averaged tne the lrgest N wos used forthe overall sample sz). 24. Pllorm posturography and deocton of bolance deficits due to vestibular impairment 13 \a77 (01 Fobso, 1996) 25. General intligence and success in military pilot training (Martinussen, 1996) 13 15,403 26, Saltroport scores of achievement motivation and spontaneous ochievemen! behavior 15 (k='104) (Spangler, 1992; coelicien derived from the weighted averoge of the semioperant ond ‘pera citerion data reporied in Spongle's Table 2). 27. Croduate Record Exemn Verbol or Gniitive scores and subsequent graduate GPA in 15 963 psychology [EL Goldberg & Alger, 1992) 28, Low sercenin metabolites in cerebrospinal fuid (SHIAA) and subsequent sicide atempts 16 40 (Lote, 1995} 29. Personality tesis ond conceptually mecninghl job performance cietia (dota combined fom «161,101 Roberson & Kinder, 1993; Tet, Jackson, & Rothein, 1991; ond Tet Jackson, Rotten, & Reddon, 1994; we used the single scole predictors Fom Roberton & Kinder hee Table 3} Gnd the confirmatory results rom Table In Tet et al, 1994) 30. implicit memaryfesk Gnd diferentaon of normal cognitive cbiity kom dementia (Meiron & «16 1,156 Jelcc, 1995) 31. MMPI Cook Medley hos scole levaions and subsequent death fom all causes (. 16 4747 Hiller, Smith, Turner, Gujaro, & Holl, 1996; data were dravn from their Table 6). 32. Motivation to manage from the Miner Sentence Complaion Test ond monogeral 17 2,151 fectveness (Carson & Ciliord, 1993; results ware overaged across the three performance terion mecsures of manogeril success. Because the three elon mecsures were not independent ovoss studies, the N reported is the largest N used for any single erterion) 33. Exteversion ond subjective wellbeing (DeNeve & Cooper, 1998). 17 10,364 34, MRI Ts hypesnvensies ond Giferenition of affective csorder palin from heothy contols ‘17 1575 (Videbecht 1997, dota lrom Videbech’s Tables 1 and 2 were combined, but only nose Slatsics used by he orginal auhor are included here 25, Tes ona els ae ower coal grades Menkes, 1980; epaed feast crrage, 17 5,750 effect size for he course grade ond CPA dota from Hembree's Table 1. Parlcipanis were txsumed tobe independent across studies) 36. High ait onger assessed in on interpersonal analogue and elevated blood pressure 1B k= 34) ergs, hon tod, & Shes, 1996; dato coma fom he “Overa clo of thew Toble 4) 37. Reduced blood flow and subsequent thrombosis or failure of synthetic hemodialysis gra 18 4,569 Fashion, Ram, Bi, & Work, 1999), 38. MMPI vlidiyscoles and delecion of known or suspected underreported paychopathok 18 328 (Boer, Welle, & Bery, 1992, weighted overage eft size wos calelotad fiom dota. Tepored in heir Table forall sues vsing participants presumed fo be underreporting 39. Dexamethasone suppression fs! scores and sobsequent suicide (leer, 1992 19 626 40. Shortterm memory Tess ond subsequent job peviormance (Verive & McDaniel, 1996) 191774 421, Depression fst scores ond subsequent recurrence of herpes simplex virus symptoms (Zorilo, "20 333 McKay, Luborky, & Schmii, 1996; effet size i for prospective studies) 42. Four preoperative cordioc tds and prediion ef death or Ml hin T woek of voscuar 20 1,991 surgery Mono of 1994; th lar ns conrad wer ipyadomaletolun Scintigraphy, ejecon lacion estimation by radionuclide ventriculography, ambulctory ECG, td deturmin ter ECG the ues conded noes won conse syparr fe hers 43. Scholastic Aptiude Test scores and subsequent college GPA [Baron & Norman, 1992). 20 3,816 (table continues) February 2001 + American Psychologist 137 Table 2 (continued) Prediaoe and ction ody end ro 144, Seleeporied dependency test scores and physical illness (Bornstein, 1998; weighted elect 2a wor colclated rom the reospectve ties reported in Bornsoin’s Table [Studies 3, 5,7, 8, 13, and 19] and the prospective suds led In Bornstein’ Table 2 [Suis 1] 45. Daxamethasone suppression test scores ond piychotic vs. nonpsychotc mojor depression [Nelson & Dovis, 1997; elect size calculated fom the weighted effects forthe incvidval Shclis inthis Table 1] 46. Traditional ECG stress let resus ond coronary artery disease (Fleischmann, Hurink, Kuntz, & Douglas, 1996; resuls were estimated fom the reported sensivty ond specify in cenncton with eb le of coronary try ate ond he ot ndaprdant oeroes sudioa. 47. Graduate Record Exam Quantitative scores and subsequent graduate GPA (Monriton & Morrison, 1995) 48, TAT scores of achievement matvaion and spontaneous achievement behavior (Spangler, 1992; coolicient was derived fom the weighted averoge o he semloperan! and operont criterion data in Spangle’s Table 2) 49. lyometriestength lest scores and job ratings of physical ably (Blakley, Quifones, & Crawlord, 1994) 50. Singl serum progesterone testing and diagnosis of ectopic pregnancy (Mol Liimer, Ankum, vor der Veen, & Bossuyt, 1998: following the original authors, we used only the 18 byospecve of reltospectve cohor studies listed in ther Tobie i 51. Cognitive multe performance lst scores and subsequent ilo! proficiency (Domos, 193}. 52, WBC disnactbity subscales and leaning disobiliy diagnoses (Ravle & Forness, 1984 the effect sizes from this metocanalyss are likely to be underestimates because the athors computed the average effec for individual fst scoes rather than the eect for & composi poten 59, Fit ronan ising and prediction of tom dl ron, Bouin, Hon, Benard & Fraser, 1998; dato were sgpregated across low ond high tik populations ond ecross designs wit single or repected testing for oll studies using delivery before 37 waeks os the erteron) 54. Decreased bone mineral density and lilime risk of hip fracture in women (Marshall, Johnel, & Wedel, 1996; the resus wore resid fo those hom absorpliomelry using single or dual energy, photon, or Keay; avonitave CT; quantiaive MRI; or crasound scanning. The Overel effect was esfimaied from their Table 3 using o total lifetime incidence of 15%; the effec would be smaller ifthe lifetime rik incidence was lower e.g, ihe incidence were 3%, the effect would be r— 13) Total N was derived from the n for each study in hei Teble T reporting the incidence of hip aches) 55. General nteligence te scores and funcional elfectveness across jobs (Schmit, Gooding, Noe, & Kirsch, 1984; dato were obtained from thir Table 4) 36, hel i con rd mbt wen (DeNeve & Cooper 1998), 57, Integrity fest scores ond subsequent supervisory rtings of job performance (Ones, Viswetvaran & Schmigh 199; ofc ize wos loken from ho“ prodiciveapplicen” cell of their Tobe 8 58, Selfreported dependency ts scores and dependent behavior (Bornstein, 1999; cosficient wos derived from all resci lsed in Borstein’s Table 1 os reported in his footnote 6 59, Scteticocy appretscs and heoth‘eloted teokment outcomes (Holden, 1991) 0, EleveledJorkins Actvily Survey scores and heart rte ond blood pressure cectvly (lynes, 1993; he eect size reflects he overage renctvity for heat rate, systolic blood pressure, and diastolic blood presse os roported in Lynosts Table 6, wat assumed that Sverlapping studios contributed fo oach of these crilerion estimates, so k wos extimoted os the lrges! number of efoct sizes contributing toa single cerion measure). 61. Combined intra, stable, and global avibuions for negative even ouicomes and depression (Sweeney, Andersons & Bailey, 1986; only ihe finding that deat withthe composite measure of atibutons and negative ovleome was incided. Coeficionts were Kowor for postive outcomes and fr single types of atibuttons[o.g, ermal 62. Nouroicism ond decroosed subjective wellbeing [DeNeve & Cooper, 1998) 68: Screening mommogiom result and detection of breast concer wikin 2 years (Muslin, Kouides,-& Shapiro, 1998) 21 22 22 22 22 23 23 23 24 24 25 25 25 25 26 26 26 A 27 7 1,034 984 5,431 5,186 (k= 82) 1,364 6,742 6,920 (k = 54) 7,900 20,849 40,230 8481 7,550 3,013 3,527 (ka) 5,788 9777 192/009 1B8 February 2001 + American Psychologist Table 2 (continued) Predictor nd citron (shy ond nots : N 4, Microbiologie blood culture tess to detect bloodsream infection from vascular catheters 28 1,354 {Siegmamigts eta, 1997; only results rom studies withou! rerio contamination were fommorized [sce Siegmanigra @ a, 197, pp. 933-934) " 65. Coreoctive prcten tes result and diognesis of acute appendicitis (Hollan & Asberg, 1997; 28 3,338 mroon weighed effec size was detivd from data in thelr Table 1, excluding wo studies thot {ent ue histology as the wolidating cleric and one sudy that did nol repod the prevalence of appendicitis). 66. Braduate Record Exam Verbol scores and subsequent graduate GPA (Morrison & Morison, 28 5,186 1995) 67. Hoare Peychopethy Checklist scores ond subsequent criminal recidivism (Solekin, Rogers, & 28 1,605 Sewell, 1996, ony effects for predilve studies were summarized). 68. Shorter memory tess ond subsoquont performance on job raining (Verve & McDaniel, 2 16521 1996) 69. Crom vrosound resin preterm infos ond subsequent developmental dscbilies (Ng 29 1,604 2 Boor, 1990), 70. Serum CA123 testing and detection of endometriosis (Mol, Bayram, et o., 1998) 29 2811 71: Newropsychological est scores and diferenioion of patients wath multiple sclerosis (Withert 29 (k= 322) & Sharpe, 1997) 72. For women, ECG sress tes osu ond detection of coronary artery disease (Kwok, Kim, 30 3,872 ‘Joble 1. cifers fom the Grady, Segal, & Redberg, 1999; our N was oblained from their Ni reported by the authors [3,879 vs. 3,721], though its not clear what would account for is difference, Athough the arcle cso examined the thallium sess les and the exercise ECG, there wos not suficient data for us to generate effect sizes for hese measures) 73. YASR total problems and psychiatric referral situs (receiving treatment vs. not; Achenbach, 30 142 1997; effect size was estimated from data in Part | of Achenbach’s Table 7.5. Because the percentages listed in this table were too imprecise to accurately generate effect size Cotimotes all possible 2 > 2 tables that would match the given percentages were (generated. Subsequently, the effect size wos obtained from those 2 X 2 lables that also produced odds ratios that exactly matched the odds ratios reported in the text. When founded to two decimal ploces, all eppropriate 2 X 2 tables produced the same effec size he effect size compares the sellrepars of young adults in reciment with the sellsepors of demographically matched controls who were not receiving treatment].° 74, Fecal leukocyte results and detection of acute infectious diarrhea (Huicho, Campos, Rivero, 30 7,132 & Guerra, 1996; resus ore reported for he most studied test [K = 19] For the remaining tests, effect sizes could be generated for only two small studies of fecal lactoferrin, and the overage resis fr occu bled tess were lowe r= 26; C= 7) 75. Neuropsychological est scores and differentiotion of learning dischilties (Kovale & Nye, 1985; 30 (K = 394) wwe report the resuls for neuropsychelogical functioning because it was studied most equenty) 76. Continuous performance tes! scores and differentiation of ADHD and control children (Losier, 31 720 McGrath, &Klein, 1996; overall sample weighted effect was derived by combining the ‘omission and commission data reported in their Tables 7 ond 8) 77. Effects of psychological assessment feedback on subsequent potient wellbeing (coefficient 31 120 combined the followup data reported in Finn & Tonsager, 1992; and Newman & Greenway, 1997).° 78. Expressed emotion on the CFI ond subsequent relapse in schizophrenia and mood disorders 32 1,737 {Butzlaff & Hooley, 1998} 8 79. CT resuls and detection of aortic injury (Mirvis, Shanmuganathan, Miller, White, & Turney, 32 3,579 1996; from the information provided, an effect size could no! be computed for two studies included in this meta-analysis 80. Screening mammogram results and detection of breast cancer within 1 year (Mushlin, 32 263,359 Kouidet, & Shopiro, 1998; overall effect size includes studies that combined mammography with clinical breast exomination) 81, Halead-Reton Nevrapsychological Tess an iferentin of impaired vs contol 33 858 children (Forster & Leckliter, 1994; the reporied weighted effect size is slighty inflated because some observations were based on group differences relative fo the control group standard deviation [rather than the pooled standard deviation). When possible, effect sizes yiere computed drei rom the dot ported in hei Tables and 2. The rpored N indicates the total number of independent observations across studies). (lable continues February 2001 + American Psychologist, 139 Table 2 (continued) Pract ond iter ty dre) 5 N 82. CT results for enlarged ventricular volume and differentiation of schizophrenia from contols 33 k= 53) [Roz & Roz, 1990} 83. Longerm memory test scores and diognosis of multiple sclerosis (Thornton & Raz, 1997; 33 {k= 33) ein size was ebined itm ni Teble wah the ular sy orca) Kes 84. Hare Psychopathy Checklist scores and subsequent violent behavior (Salekin, Rogers, & 33 1,567 Sewell, 1996; only effects for predictive studies were summarized} 85, Alanine aminotransferase results and detection of improved liver function in hepatitis C 34 480 tients (Bonis, loannidis, Cappeller, Kaplan, & Lav, 1997; data reflect the criterion of any istologically identified improvement 86, Rorschach scores and conceptually meaningful criterion measures (data combined from 35 (K= 122) Alkiagon, 1986, Table 1 [k= 79) ile, Rosenthal, Bornstein, Berry, & Brunel Nevitb, 1999, Table 4 {k= 30]; and K.P. Parker, Honson,& Hunsley, 1988, Table 2 [K = 14}. Hiller et al, expressed concern that Atkinson’s ond K.P. Parker et al.'s elect size estimates may have been inflated by some results derived from unfocused F tess [.e., with >1 din the numerator]. However, Atkinson excluded effects based on F, and K. P. Parker et al's average effect size octualy increased when F test results were excluded. Recenlly, Garb, Florio, & Grove, 1998, conducted reanalyses of K. P. Parker et ol’s data. Alfhough these reanalyses have been criticized [see K. P. Parker, Hunsley, & Hanson, 1999], ifthe resuls from Garb et al's fist, second, or third analysis were used in lieu of those from K. P. Parker etal, the synthesized results reported here would change by —.00%, ~.0036, or ~.0007, respecively, for the Rorschach ond by .0203, .0288, or .0288, respectively, for the MMPI [see Entry 100, this table) 87. Papanicolaou Test [Pap smear) and detection of cervical abnormalities (Fahey, Irwig, & 36 17,421 Macaskill, 1995; overall weighted effect calculated from dato reported in their Appendix 1) 88. Conventional dental Xays and diagnosis of biting surface cavities (occlusal cares; lo & 36 5,466 Verdonschot, 1994; the overall weighted effect was derived from all the studies listed in their Table 1. In each case, the original citations were obtained, and raw elfect sizes were calculated from the intial study) 89. Incremental contribution of Rorschach PRS scores over IQ to predict psychotherapy outcome 36 290 (Meyer, 2000), 90. Rorschach or Appercepiive Test Dependency scores and physical illness (Bornstein, 1998; 36 325 weighted effect size was calculated from the reirospective studies reporied in Bornstein's Table | (Sudies 1,11, 14-16, and 78]. No prospective suces used hese ype of sees 0 predictors). 91. Assessment center evaluations and job success [dota combined from Schmit, Gooding, Noe, 37 15,345 & Kirsch, 1984; ond Gougler, Rosenthal, Thornton, & Bentson, 1987; the overall effect size ‘was derived from the sample weighted average reported in each sivdy. Although Schmit et als study was conducted earlier than Gaugler et al.'s, they relied on a larger N. Because teach meto-nalysis undoubtedly relied on some common studies, the N reported here is from Schmit ot ol). 92. Competency screening sentonce-completion test scores and defendant competency 7 ou {Nicholson & Kugler, 191} 93. MCMIcII scale score and average obilty 1o detect depressive or psychotic disorders 7 575 {Ganellen, 1996; each individual study contuted one ele! size averaged ccross Glognosic citer ond type of predictor scales [single vs. mulplescoles). Results were ‘veraged across onalysee reported in diferent publications using she some sample. Alhough Sonelon ropored larger sac sizes fr studies that used mullscole predictors, hese sucies reliad on vnroplicoted mubivarial predilor equations. At such, mulscole predictors were {veraged with hypothesized, singlescalo predictor] = 94. MMPLscole scores ond average obiliy to detect depresive or psychotic dlsorders 37 927 [Conan 96; on Ey 98. ee rte oon ar 8 95. Rorschach Appetceplve Tes! Dependency scores ond dependent behavior (Bornstein, 1999; ; geet ser derived from of resus listed in Borsteins Table 1a reported in his fotnote 8). 96. Accuracy of home pregnancy tet kts in patients conditing testing at home (Bastion, 28 155 Nanda, Hesseibled, & Simol, 1998, ress derived from the pooled “ellectveness score,” Which wos described and this irecied os equivalent to Cohen's d. Also, findings were very Giferen when testy vere evlucted vsing researcher assisted volunieors rather than octal patients (7> 81; N= 463} 140 February 2001 + American Psychologist Table 2 (continued) Predictor and criti faedy ad noes) 97. 98. 99, 100. 101 102. 103. 104. 108. 107. 108. 109. 110, Mm 12. Sperm penetration essoy resis ond success wih in vito frlization (Ml, Meier, eal, 1998) Endovaginel vlrasourd in posimenopausel women and detection of endometicl concer [SeinBncman 1998; oferta was dorved rm to lor pooked cai (her ‘able 2] vsing their recommended cutoff of 8 mm to define endometrial thickening). MPI lids sceles and detection of underreported paychopathalogy [primary analogue Nales Baer Wetter, 6 Berry, 1992; weighted overage effec size colculaed from dota in thei Tobe Ti NMP cores ond conceptully meaningful citerion measures (dato combined from Alkinson, 1986, Table 1; Hiler, Rosenthal, Bornstein, Berry, & BrunelLNeviei, 1999, Table 4; and K:P- Parker, Hanson’ & Hunsey. 1988, Toble 2. See also Enty 86, this lable) Neviopsyehologs lstbased jadgmenls and presence/absence of impairment (Gorb & Schrorke, 1996; coefcient was calculated from the asevracy of judgments elatve lo bose rates [see Gorb & Schrame, 1996, pp. 143, 144~148}} Prosiospectic ontigen ond estimoted detection of prosioie cancer for men aged 60-70 (Aziz & Borothur, 1993) Sertorm verbal learning and diferentiaton of major depression from controls (Viel, 1997, clfhough the cuthor reported many effect sizes, we repor the variable that was studied most shen). Ci result and detection of lymph node metastases in cervical cancer [Scheidler, Hricok, Yu, Subok, & Segal, 1997, an effet size could not be computed for one study included in his mmetoanalysis) Disoctative Experiences Scale scores and detection of MPD or PTSD vs, controls (Von Uzendoorn & Schuengel, 1996; we assumed the Ns for boh criterion diagnoses were not inlpwder so he epedN thet rhe geome Colposcopy ond detecion of notmal/lowegrade SIL vs, highgrade SIL/cancer ofthe cervix (Mitchell Schotlenfeld, TortoleroLuna, Cantor, & RichardsKortum, 1998; elect sizes were Coleuloed from dota reported in their Tobe 3). Cmca! tuber count on Mil and degree of impaired cognitive development in tuberous Sclerosis (M, Goodman et ol, 1997] Conventional devel Xrays end diognoss of betweentooth cavities (approximal cris; Van Rom 8 Verdonschc, 1993 hs on weighed fc ie fr al cis! wed "rong" valiyerilerion (ce, mictradiogrophy, Hetology ot cviy preparation} Cordigc Huoroscopy and dhognosis of coronary artery diseave[Glonross, Deron clo, Fealhey 1990) amare Serum chlamydie entivody levels and detection of ferilty problems due to tubal patholo (att 1997; ony the rerl for he optimal predictor essays ond optinelcheion Imoosures ae presened) Rorschach PRS scores and subsequent psychotherapy outcome (Meyer & Handler, 1997, 2000} Bigitely enhanced dental Xrays and diagnoss of biting surfaces coviles le & Verdonschot 1994; he overall weighted efect size was Gerved from all ho suds listed in their Toble 1. in each ease, he original citations wore obtained, and raw elect sizes were colcuated from the intial sd), 3. WAIS I@ cand obtained level of education (Hanson, Hunsley, & Porker, 1988). 4. MMPI Valdty scales end detection of known o” suspected molingered psychopathology 994; 165. 6. v7. {dot conbined rom Bey, Bor, & Haris, 191; ond Roger, Sawa, & Salkin, 1¢ overage weighted effec size wos calculated from dota presented in Tables | ond 2 of Berry et ol and Table 1 of Rogers etal. for participants presumed or judged to be malingering disturbance}. Dedimer blood test results ond detection of deep vein thrombosis or pulmonary embolism (Becker, Pilbrick, Bachhuber, & Humphries, 1996; results are reported for only the 13 [of 29] studies with stronger methodology). Exercise SPECT imaging and identification of coronary artery disease (Fleischmann, Hunink, Kuntz, & Douglos, 1998; results were estimated from the reported sensitivity and specificity in corunaion wih he base rte of coronary ary sees and the otal independent N ‘across studies) ‘Antineutrophil cytoplasmic antibody testing end detection of Wegener's granulomatosis (Rao ral, 1995, senstiy Tor each sedy wos estimated rom thir igure 1 ' 39 39 39 39 40 40 al 4l Al 42 43 43 a3 Ad Ad 45 4S 46 47 5 1,335 3/443 2,297 (k= 138) 2,235 4,200 (k= 10) 1,022 1,705 2,249 157 (k= 8) 3,765 2,131 783 2,870 k=9) 7m” 1,652 3,237 13,562 (Yoble continues} February 2001 + American Psychologist Mi Table 2 (continued) Predictor and citron oid ord ete) N ne, 119. 120. 121 122 123, 124, 125, 126, 127. 128. 129. 130, 131 132 133, 134, 136. 136. Technetium bone seanning results and detection of osteomyelitis (bone infection; litenberg, Mushin, & the Diagnostic Technology Assessmen! Consortium, 1992) he nical examination with routine lab tess ond detection of metastatic ling cancer [Ses tng Coles, 995). Lecithin/ sphingomyelin roto end prediction of neonatal respiratory distress syndrome forse Smith, Qkorodudv, & Bissell, 1996; he most Fequenly sucled predictor Test was reported) Sensitivity of total serum cholesterol levels fo changes in story cholesterol (Howell, MeNmaro, Tosca, Smith, & Gaines, 1997) Memory recall exts and diferntiotion of schizophrenia from contols (Aleman, Hiiman, de Hoon, & Kohn, 1999; effect size is for skdes wih demographically nalehed comparison paricipents) BCL paren! report of total problems and psychiatric refer stots (receiving keciment vs not Achenbach, 1991; raw data fo generofe this eflect size were cbicined kom Themes M. Achenbach {personal communication, February 5, 1999], Coeficient comperes parent ‘alings of eilren in Weafment o parent ratings of demographically matched conta children not receiving reatment, \WAIS IG subvests ond diferentation of demontio rom contol (H. Christensen & Mockinnon, 1992; eect computed from dota presented in thei Tables | ang 2. The reported N's forthe largest somple ocross the ineividel soles! comparisons) Single scum progesterone testing and diagnosis of any nonvigble pregnancy (Mol, Limer, co, 1998; flloning the ongiol thors we used only he 10 proseacve cohen sche fised'n ther Table Ih MRI resuls ond detection of eptured scone gel breast implons (C. M. Goodman, Cohen, Thomby, & Nolscher, 1998; these authors found that mammography [= 21, N- 381] gee lesoud (42, N.— Sa were es facing on Ml) Association of Hachindk ischemic scores wth postmortem cossifizaton of dementia type (Moroney eo, 1997; elect size computed from ther Figure 1 using Coninuovs scores ond the Alzheimer's, mixed, ond mulinfare! group classifications on @ cotinuum), MRI resus and detection of lymph node metasioses in cervical cancer (Scholar, Hricak, Yu, Sobok, & Segal, 1997; on elec size could not be computed for one sudy thluded in this metoanalysi Cognitive fests of information-processing speed and reasoning ability [Verhaeghen & Sathouse, 1997) * "y Merhows MRI cess ond diferentiaion of dementia from controls (Zakzanis, 1998; PET and SPECT findings from this metaanalysis were sighly less valid or bsed on smaller samples, co oe nol reported, Nevropsychological findings wore not used because b, Chistensan, rica Povlovic, & Jacomb, 1991, reported a more extensive meto-onalyis WAIS IQ scores and conceptually meaninghul ciferion measutes (K.P. Parke, Henson, & Hundley, 1986, Table 2; Hiller, Rosenthal, Bornstein, Berry, & BrunelLNevleib, 1999, ‘expressed concem about K.P. Parker et c's resus because some effec sizes came from Unlocused Frese, > oFin the numerote), ough the overall fect increases when these resus are exclodd) Exercise ECG resus ond Idenification of coronary artery disease (leschmann, Hung, Kints, & Dovglos, 1998; results were estimated from the reported senstvily and specciy inconuneson wih he bow eof coonary rey dee and he a dependent N cross suche). Utrasound results and identification of deop venovs thrombosis (Wells, Lensing, Davidson, Pin ah 1998 : “ocdizatono - NeuropsychologistTestbosed judgments and presence localization of impoirment (Ga Schromie, 1996; effect size calculated from the accuracy of ladgmentsrlenve fo bose rates [see Garb & Schramke, 1996, op. 143, 144-145} longterm verbal memory lesis ond dflereniion of dementia from depression (H Ghrstensen, Grif, MacKinnon, & Jocomb, 1997; eet data taken fom ther Table 4) CTresults ond detection of metastases from head and nock cancer (Meri, Wilioms, fomes, & Porvbtky, 1997; N was obtained from the original studies) 48 50 50 50 51 52 52 53 55 55 S7 57 58 60 61 255 1,593 1,170 (k= 307) 2,290 4220 516 3,804 382 312 817 4,026 374 (k= 39) 2,637 1,616 1,606 (k = 32) 317 142 February 2001 + American Psychologist Table 2 (continued) Fedral ony rd 5 oN (k= 94) 137. Neuropsychological tests ond dliferentiation of dementia from controls [D. Christensen, 68 HodziPavlovie, & Jacomb, 1991; the effect size was derived from studies explicitly stating that dementio had been diagnosed independent ofthe neuropsychological test resus [ D. Christensen et o., 1991, p. 150}. 138. Immunoglobulin antiperinvclear factor scores and detection of rheumatoid arthritis, 68 2,541 (Berthelot, Garni, Glemarec, & lipo, 1998} 139. MMPI Validity scales and detection of molingered psychopathology (primarily analogue 74 11,204 studies; data combined Irom Berry, Baer, & Harris, 1991; and Rogers, Sewell, & Solekn, 1994; average weighted elle size colculated from Tables T ond 2 of Bery et ol. ond Tablet of Rogers elo) 140. MAI besic seaos: hoollet ys. computerized form (Finger & Ones, 1999; the akernate 78 732 forms reliably cooficinis for eoch scale were weighted by sample size [re from 508 to 872), and the average Nis reported). 141. Thoytecle impedance scores ond cretion measures of cardiac stoke volume and output 81 fuller, 1995s only date hom methodologically “adequate” studies were included. The mean Weighted correlaton foreach cision measure wos weighted by ihe numbet of suds Conribuing i he mean ond then averaged across al crerion measures, Bocavte Fuller 11992, p. 105} crypicaly soled that studies were excided unless here wos “concurrence of measurement Eehveen the two insumonts boing compared,” itis possible that relevant siudies were omited when the findings did not suppor the hypothesis. 142. Creatinine clearance Yes! results ond Kidney funcion iglomerval fliralion rat; Campens & 83 Buntins, 1997; result for measured and estimated [bythe Cockrof-Goul formula} creatine clecrance were pooled. The N reported In our lable is sighlyinflced bacouse i Wwes impossible 0 idently the specific n for ho of the studies that used both meceures) 143. Bopler urasonography result ond identification of peripheral artery disease (de Vries, @3 4.906 Hurink, & Polak, 1996; weighted effect size derived from data in Rei Table 2 vsing N refers fo the number of observations; some patients wore (k= 24) 2,459 patient samples. The repor lst mutile Himes) “beer 144, Finger or eor pulse oximetry readings in patients and arterial oxygen saturation (L.A. 84 Jensen, Onyskiw, & Prasod, 1998), 4,354 Note, ADHO ~ otanlondatct hyperactivity disorder; C8CL = Child Bahovior Check CF! = Camberwell Family Iter; CT = computed tomography: ECG lesocardogran: GPA ~ grode pint oero9e 10 = ineligence quote”, k = numberof olac uz codhinutng lo he mean esate; K= number of dis onibung fo he mon sera: MCMIcI = Millon Cla! Melo! Inverory~2nd Eston; MNP = Miorasolo Muliphasic FersanaiyInverory; MPD = ‘muliple pertonaliy dorder Ml = mognetic resonance imoging; PET = poston emision tomography; PRS ~ Pragnshi Rating Scale PTSD ~ postaumat sas Grordes Sil” sqvomosinoopiell sions, SPECT ~ single photon eniaion compe toragraply, TAT = Thema Appercaption Tea, WAIS = Weck ‘dhl inigence Seale; WISC ~ Weer naligence Soler Chiron, VASR = Young Adult Se Raps The ach! wet wos a soscaly nonsignificant vue of 013 [re in he decom of oppose of predcton). Tila mater refers to the jon use of ipholeloprten, humen chorionic gonodetopin, and unconjugated ei, These rel oe ral ram melaanalses and were Act dened tough out ‘stmt leroce soa Distinctions Between Psychological Testing and Psychological Assessment Psychological testing is a relatively straightforward process wherein a particular scale is administered to obtain a spe- cific score. Subsequently, a descriptive meaning can be applied to the score on the basis of normative, nomothetic” findings. In contrast, psychological assessment is con- cerned with the clinician who takes a variety of test scores, generally obtained from multiple test methods, and consid- crs the data in the context of history, referral information, and observed behavior to understand the person being evaluated, to answer the referral questions, and then to communicate findings to the patient, his or her significant others, and referral sources. In psychological testing, the nomothetic meaning as- sociated with a scaled score of 10 on the Arithmetic subtest from the Wechsler Adult Intelligence Scale—Third Eiition (Wechsler, 1997) is that a person possesses average skills, in mental calculations. In an idiographic assessment, the same score may have very different meanings. After con- sidering all relevant information, this score may mean a patient with a recent head injury has had a precipitous, decline in auditory attention span and the capacity to men- ° Nomothetic seers to general laws or principles, Nomothetic re search typically studi the relationship among & limited number of characteristics scros a large numberof people diographic refers tothe imensive stody ofa single individual Here, the focus is on how a large umber of characterises it together uniquely within one person or in he content ofa singe ie, February 2001 + American Psychologist 143 tally manipulate information. In a patient undergoing cog- nitive remediation for attentional problems secondary to a hhead injury, the same score may mean there has been a substantial recovery of cognitive functioning. In a third, otherwise very intelligent patient, a score of 10 may mean pronounced symptoms of anxiety and depression are im: pairing skills in active concentration. Thus, and consistent with Shea's (1985) observation that no clinical question can be answered solely by a test score, many different conditions can lead to an identical score on a particular test. The assessment task is to use test-derived sources of infor- mation in combination with historical data, presenting complaints, observations, interview results, and informa- tion from third parties to disentangle the competing possi bilities (Eyde et al., 1993). The process is far from simple and requires a high degree of skill and sophistication to be implemented properly. Distinctions Between Formal Assessment and Other Sources of Clinical Information All mental health professionals assess patient problems, ‘Almost universally, such evaluations rely on unstructured interviews and informal observations as the key sources of information about the patient. Although these methods ean be efficient and effective ways to obtain dat, they are also limited. When interviews are unstructured, clinicians may overlook certain areas of functioning and focus more ex- clusively on presenting complaints. When interviews are highly structured, clinicians can lose the forest fr the trees and make precise but errant judgments (Hammond, 1996; Tucker, 1998). Such mistakes may occur when the clinician focuses on responses to specific interview questions (e.g, diagnostic eriteria) without fully considering the salience of these responses in the patient's broader life context or without adequately recognizing how the individual re- sponses it together into a symptomatically coherent pattern (Arkes, 1981; Klein, Ouimette, Kelly, Ferro, & Riso, 1994; Perry, 1992). ‘Additional confounds derive from patients, who are often poor historians and/or biased presenters of informa- tion (ee, e.., John & Robins, 1994; Moffitt et al, 1997: Ropler, Malgady, & Tryon, 1992; Widom & Morris, 1997). For instance, neurologically impaired patients frequently lack awareness of their deficits or personality changes (Lezak, 1995), and response styles such as defensiveness or exaggeration affect the way patients are viewed by clinical interviewers or observers (see, e-.. Alterman et al, 1996; Pogge, Stokes, Frank, Wong, & Harvey, 1997). Defensive Patients are seen as more healthy, whereas patients who exaggerate their distress are seen as more impaired. In contrast to less formal clinical methods, psychological test- ing can identify such biased selfpresentation styles (see Entries 38, 99, 114, & 139 in Table 2), leading to a more accurate understanding of the patient's genuine difficulties ‘There are several other ways that formal psychologi- cal assessment can circumvent problems associated with typical clinical interviews. First, psychological assessments generally measure a large number of personality, cognitive, oF neuropsychological characteristics simultaneously. As a result, they are inclusive and often cover a range of func tional domains, many of which might be overlooked during less formal evaluation procedures. Second, psychological tests provide empirically quan- tified information, allowing for more precise measurement of patient characteristics than is usually obtained from interviews, Third, psychological tests have standardized adminis- tration and scoring procedures. Because each patient i presented with a uniform stimulus that serves as a common yardstick to measure his or her characteristics, an experi- fenced clinician has enhanced ability to detect subtle behav: ioral cues that may indicate psychological or neuropsycho- logical complications (see, e-g., Lezak, 1995). Standardiza- tion also can reduce legal and ethical problems because it minimizes the prospect that unintended bias may adversely affect the patient, In less formal assessments, standardiza- tion is lacking, and the interaction between clinician and patient can vary considerably as a function of many factors. Fourth, psychological tests are normed, permitting ‘each patient to be compared with a relevant group of peers, which in turn allows the clinician to formulate refined inferences about strengths and limitations. Although clini- cians using informal evaluation procedures generate their ‘own internal standards over time, these are less systematic and are more likely to be skewed by the type of patients seen in a particular setting. Moreover, normed information accurately conveys how typical or unusual the patient is on a given characteristic, which helps clinicians to more ade: quately consider base rates—the frequency with which certain conditions occur in a setting (see, eg., Finn & Kamphuis, 1995), Fifth, research on the reliability and validity of individual test scales sets formal assessment apart from other sources of clinical information. These data allow the astute clinician to understand the strengths or limitations of various. scores. Without this, practitioners have litle ability to gauge the accuracy of the data they process when making judgments. The use of test batteries is a final distinguishing f ture of formal psychological assessment. In a battery, psy- cchologists generally employ a range of methods to obtain information and cross-check hypotheses. These methods include self-reports, performance tasks, observations, at information derived from behavioral or functional assess- ‘ment strategies (see Haynes et all, 1997). By incorporating multiple methods, the assessment psychologist is able to efficiently gather a wide range of information to facilitate understanding the patient Cross-Method Agreement ur last point raises a critical issue about the extent 10 which distinct assessment methods provide unique versus redundant information. To evaluate this issue, Table 3 presents a broad survey of examples. As before, we at- tempted to draw on meta-analytic reviews or large-scale studies for this table, though this information was not often available. Consequently, many of the entries represent a 144 February 2001 * American Psychologist new synthesis of relevant literature.'° To highlight inde- pendent methods, we excluded studies that used aggrega- tion strategies to maximize associations (e.g., self-reports correlated with a composite of spouse and peer reports: see Cheek, 1982; Epstein, 1983; Tsujimoto, Hamilton, & Berger, 1990) and ignored moderators of agreement that may have been identified in the literature. We also ex- cluded studies in which cross-method comparisons were not reasonably independent. For instance, we omitted stud- ies in which patients completed a writen self-report instru- ‘ment that was then correlated with the results from a structured interview that asked comparable questions in an ‘oral format (see, e.g., Richter, Wemer, Heerlein, Kraus, & ‘Sauer, 1998). However, to provide a wide array of contrasts ‘across different sources, we at times report results that are inflated by criterion contamination. A review of Table 3 indicates that distinct assessment methods provide unique information. This is evident from the relatively low fo moderate associations between inde- pendent methods of assessing similar constructs. The find- ings hold for children and adults and when various types of knowledgeable informants (e.g., self, clinician, parent, peer) are compared with each other or with observed be- haviors and task performance. For instance, child and ad- olescent self-ratings have only moderate correspondence ‘with the ratings of parents (Table 3, Entries 1~4), teachers (Table 3, Entries 8-10), clinicians (Table 3, Entries 5 & 6), or observers (Table 3, Entry 7), and the ratings from each of these sources have only moderate associations with each other (Table 3, Entries 12-18, 20-21). For adults, self- reports of personality and mood have small to moderate associations with the same characteristics measured by those who are close to the target person (Table 3, Entries 23-25, 29-30), peers (Table 3, Entries 26~28), clinicians (Table 3, Entries 31-34), performance tasks’ (Table 3, Entries 38-44), or observed behavior (Table 3, Entries 45-47). The substantial independence between methods clearly extends into the clinical arena. Not only do patients, clinicians, parents, and observers have different views about psychotherapy progress or functioning in treatment (see Table 3, Entries 3, 7, & 31) but diagnoses have only moderate associations when they are derived from self- reports or the reports of parents, significant others and clinicians (see Table 3, Entries 4, 6, 15, 17, 30, 33, 34, 48, & 49)!" The data in Table 3 have numerous implications, both for the science of psychology and for applied clinical practice. We emphasize just two points. First, at best, any single assessment method provides a partial or incomplete representation of the characteristics it intends to measure. Second, in the world of applied clinical practice, itis not easy t0 obtain accurate or consensually agreed on informa- tion about patients. Both issues are considered in more detail below. Distinct Methods and the Assessment Battery ‘A number of authors have described several key features that distinguish assessment methods (see, e.g.. Achenbach, 1995; Achenbach, McConaughy, & Howell, 1987; Finn, 1996; McClelland, Koestner, & Weinberger, 1989; Meyer, 1996b, 1997; S. B. Miller, 1987; Moskowitz, 1986; Winter, John, Stewart, Klohnen, & Duncan, 1998). Under optimal conditions, (a) unstructured interviews elicit information relevant to thematic life narratives, though they are con- strained by the range of topics considered and ambiguities, inherent when interpreting this information; (b) structured interviews and self-report instruments elicit details con- ceming patients’ conscious understanding of themselves, and overtly experienced symptomatology, though they are limited by the patients’ motivation to communicate frankly and their ability to make accurate judgments; (c) perfor- rmance-based personality tests (e.g., Rorschach, TAT) elicit, data about behavior in unstructured settings or implicit ‘dynamics and underlying templates of perception and mo- tivation, though they are constrained by task engagement and the nature of the stimulus materials; (4) performance- based cognitive tasks elicit findings about problem solving and functional capacities, though they are limited by mo- tivation, task engagement, and setting; and (e) observer rating scales elicit an informant's perception of the patient, though they are constrained by the parameters of a partic ular type of relationship (e-g., spouse, coworker, therapist) and the setting in which the observations transpire. These distinctions provide each method with particular strengths for measuring certain qualities, as well as inherent restric- tions for measuring the full scope of human functioning. More than 40 years ago, Campbell and Fiske (1959) noted how relative independence among. psychological ‘methods can point to unappreciated complexity in the phe- nomena under investigation. Thus, though low cross- ‘method correspondence can potentially indicate problems ‘with one or both methods under consideration, correlations, can document only what is shared between two variables, ‘AS such, cross-method correlations cannot reveal what ‘makes a test distinctive or unique, and they also cannot reveal how good a test is in any specific sense. Given the intricacy of human functioning and the method distinctions, outlined above, psychologists should anticipate disagree- ments when similarly named scales are compared across diverse assessment methods. Furthermore, given the valid ity data provided in Table 2, psychologists should view the results in Table 3 as indicating that cach assessment method identifies useful data not available from other sources, As is done in other scientific disciplines (Meyer, (iext continues on page 150s "© For Table 3, we searched PsycINFO using a variety of strategies. ‘We also relied on bibliographic citations from contemporary articles and reviews. Although we undoubtedly overlooked pertinent stuics. our Search was extensive. The 38 ene i Table 3 intgrat ata fom more than 800 samples and 190.000 participants, and we included ll ses, that fi within ou Search parameters, Thus, we are confident the findings ane robust and generalizable " Methodologcally, agreement between diagnoses derived from selfsepots and clinilans i inflated by ererion contamination because Glincians most ground thir diagnostic conclusions in th information reported by patients. Similer confounds alo likely fect the associations between sell-tings and signifeant-oher ratings. February 2001 + American Psychologist 145

You might also like