You are on page 1of 9
RESEARCH METHODS IN PSYCHIATRY A Checklist for Evaluating the Usefulness of Rating Scales* Davin L, STREINER, Pno.! Rating scales of various sorts are very useful for both clinical and research purposes. However, they vary greatly regarding their reliability, validity, and utility. This article provides a guide for people who need to evatuate scales. either to incorporate them into their own research o- clinical activities, or to determine if the results of studies which use scales are meaningful. The different types of reliability and validity are discussed and guidelines are offered 0 evaluate how well these were assessed. Finally, other aspects of scales which can affect their usefuiness, such as completion tine, training. and scoring ease, are discussed. AB. sees sles te wbiauitows in psychiatry: They are wsed to measure symptoms which are found in most disorders, such as depressed mood, anxiety, or positive and negative symptoms of schizophrenia, to determine the presence of some trait, such as self-esteem; or for a variety of other purposes (1-5). In many studies, a scale is used as the only criterion lo rate the effectiveness of an intervention (6,7) Consequently, we must have some confidence that the scale measures what it purports to and that it does so with a minimum of error. In this articte, we will examine some of the characteristics of scales and give some criteria which can be used to evaluate their utility. The intended audience are people who want to be better able to evaluate articles which have used rating scales and those who intend to use existing scales in theirresearch. This paper is not meant to be a primer in how to construct scales; those who are interested in doing so would be better served by reading one of the more com- plete texts (8-10), Ic may be useful to begin by differentiating the term scale from other, similar terms such as index, inventory, test, or schedule. Some authors have tried to distinguish a scale from an index on the basis of the tool's measurement properties; one yields a score which is on an ordinal scale (the intervals, between the numbers cannot be assumed to reflect equal increments in the trait being measured), and the other pro- "Manuscript received July 1991 ‘Departments of Psychiatry, Clinical Epidemiology and MeMaster University, Hamilton, Ontario Address reprint requests to: Dt. David L. Steiner, Deparment of Clinical Epidemiology and Biesatstcs, Faculty of Health Sciences, MeMaster Uni versity, 1200 Main Street West, Halton, Onario L8N 325 Can, J, Psychiatry Vol. 38, March 1993, ia duces @ score on an interval scale (ie, the intervals represen equal increases in the trait) Unfortunately, people can't agree on which termis which 30 now the nwo lerms are used interchangeably. Yet a thirc term, inventory, can also be used as a synonym, although most people use it to refer to an insirument which consists of a number of different independent seaies, such as the Minne- sota Multiphasic Personality Inventory (MMPI) (11) or the Millon Clinieal Multiaxial tnventory (12). A test used to refer to an instrument which a person could pass or fail, such as the final exams (“achievement tests” in more formal parlance) ‘we struggled through in school, or the admission tests we took to get into graduate or professional school, There are how- ever, exceptions to this usage: the Test of Anentional and Inserpersonal Style (13) for instance, would more accurately be called an inventory. What all of these instruments have in common, despite their differences in terminotogy, is that they are arcempting t0 measure the amount of some attribute, be it anxiety, depres- sion, or knowledge of cardiac functioning. schedule such as the DIS (14) or the SADS (15) is more like a checklist, which counts the number of signs or symptoms present. This is then used to arrive ata diagnosis, based on acertain number of items being endorsed, of the pattem of endorsement of specific items. Rarely do schedules yield a number which can be manipulated statistically In this paper, we will focus on scales since the methods used to develop and validate them differ from those used in the development of schedules. The article itseif will centre around a checklist, which can be used when you are assessin a scale’s potential utility (see Table D. ‘The Items Where They Come From Scales, uillike the goddess Athena, rarely spring full- grown out of their creators’ heads. Rather. there is usually a process of gathering potentially useful items from various sources and then winnowing out those which do not meet certain criteria, Items may come from existing scales, clinical ‘observation, expert opinion, patients’ reports of their subjec- tive experiences, research findings and theory. There are advantages and disadvantages associated with cach of these sources, The user should be aware of where the scale’s items came from and the possible biases which may therefore have been introduced. March, 1993 $$, Table f A Checklist for Evaluating Scales ems ‘Where did they come from? + Previous scales + Clinical observation + Exper opiaion + Patients’ repons + Research findings + Theory Were they assessed for + Endorsement frequency? + Reswrictions in range? + Comprehension? + Lack of ambigu + Lack of value-laden oroflensive conten? Retisbicy ‘Was he seal checked for internal consistency? Does ithave good test-retest reliability? Does it have good interrater agreement (i appropiate)? (On what groups were the elites estimaed? How was reliability cateulated? Vatdivy Does he sale have face wali? should i? Isthere evidence of content validity? Has it been evalated fr erion vliiy (i posible)? What studies have ben done to asess conic vali? \Wiah what groups has che eae been validated? | ay | can tte sale be complete! ina essonable aroun of ie? How much taining is eeded to learn how to administer it Is iteasy to score? Existing Scales It is sometimes said that borrowing from one source is plagiarism, but taking from two or more is research; the same. principle holds true in scaie development. Older scales may be deemed inadequate because they do not adequately cover all aspects of an area. However, their items are frequently quite useful, and may have gone through their own process of weeding out bad ones. Moreover, if one were trying to tap depressive mood, for example, it is difficult to asx about appetite loss in a way that hasn't bent used before. Conse- quently, items from éartier scales often appear on more recent ‘ones. However, there is the danger that their terminology may have become outdated in-the interim. Until it was modified in 1987, the MMPI contained such quaint terms as “deport- ment,” “cutting up.” and “drop the handkerchief.” Endorse- ATING SCALE CHECKLIST Lat ment of these items revealed more about person's age th age than his or her behaviour, . Clinical Observation ‘The impetus for developing a new scale, either in an area where previous scales exist orin virginal temitories, isclinical observation — seeing phenomena which aren't tapped by existing tools. The major problem with this is that the person developing the scalt may have “seen” phenomena which aren't apparent to other people, Unfortunately, psychiagry's history is replete with such spurious observations (16), Ifa scale is based on these unreplicated findings. 2 later user of the instrument may expend considerable effort executing a study, oniy to find out in the end that the dependent measure is of Tine clinical use. Expert Opinion A step up from the clinical wisdom of one person is the collective wisdom of a panel of experts. The obvious advan- tage is that a range of opinions can be expressed, so that the scale is less the result of one idiosyncratic way of thinking or limited by one petsen’s clinical experience. However, there is always the danger that the “panel of experts! was chosen because its members are all known to the senior author of the seaie, and they all share his or her viewpoint and biases regarding the domain to be measured. In this case, there is probably more of an illusion of a breadth of opinions, rather than the reality of it. Consequently, just because uke items were contributed by 2 panel, do not automatically assume that they reflect a broader spectrum of ideas. Patients’ Reports Clinicians may see the behavioural manifestations of a disorder, but only the patient is able to report on the subjective elements. Quite offen, itis tiese aspects that differentiate one disorder from another, or which provide the best clues about severity. Some scales though, such as the Hamilton Rating Scale for Depression (2), are based on a phenomenological approach to psychiatry, so there is a decreased emphasis on the patient's subjective state. If the patient's experience is important to you, for example, you may wish 12 look for a different scale of dysphoria. This could only be determined, though, if the scale’s author reported this orientation in the original paper. Research Findings Empirical research provides another fruitful source of potential items (for example, many of the items on scales which differentiate™process” from “reactive” schizophrenics were based on empirical findings), such as the increased probability of the involvement of alcohol among reactive schizophrenics (17). However, just because one paper re ported a finding is no guarantee that it will be replicated (16). Theory Most scales used in psychiatry reflect some theoretical approach, although this may aot be stated explicitly in the paper describing the scale’s development, or even acknowl edged by the scale’s author. However, the wide disparity in 142 CANADIAN JOURNAL OF PSYCHIATRY content among various scales of social support is a manifes tation of the authors’ theories of that concept (for example, whether social support does or does not invoive elements of reciprocity, or if the key ingredients are the aumber of people in the network, or the frequency of contacts, or both) (18,19), Unless the scale’s developer was explicit about the belief system underlying the instrument's development, it may be necessary for the user to compare the items on the scale with the beliefs about what the scale should cover, using the “content validity matrix” described in the section on validity. Assessment of the lems Developing a poot of items isa necessary first step in scale construction. However, it is extremely unlikely that each of the items will perform as intended. Some will not be under stood or will be misinterpreted by respondents, and athers may not discriminate between people who have the trait and those who do not. Consequently, itis necessary 19 cut out bad items if they do not meet certain criteria, Endorsement Frequency Ifone item on a depression scale read, "I have cried a least once in my life,” itis obvious that the vast majority of people will answer “yes”, If we don’ teven bother toask the question, but simply assume the answer to be yes, no information would be lost. In fact, we would probably have less measurement ‘error by not asking the question since the response would not be contaminated by random responding, lying, illiteracy, or anything else which might affect the true answer. The moral of the siory is that if some items are answered in one direction or another more than 90% or 95% of the time, they may be even worse than useless. Restriction in Range A similar problem may exist with questions which allow a range of answers, such asa seven-point scale to rate student performance which goes from “much below average” to “much above average.” Often, people avoid the extreme categories {the “end aversion bias”), thus tuming it into a five-point scale, Moreover. no supervisor has ever seen a student who was below average, which thereby forces the responses inte the two remaining boxes — those above the mid-point, but not including the end. It is obvious that this item's ability to discriminate among students is limited. A test developer should have checked for items which suffer from a high (or low) endorsement frequency, or which show only a restricted range of answers. These questions contribute nothing to a scale except noise and should have been weeded out. Comprehension . Unfortunately, health care professionals spe(id many years unlearning the mother tongue, and replacing it with a lan- guage called jargon or medicalese. People taik about “mood”; professionals speak of “affect”. Respondents to a scale may either not know these technical terms, or may misinterpret them. Many lay people think the continuum of anxiety is calm, tense and hypertensive. Hypertension is an emotional state, rather than a fancy term for high blood pressure. Fur- Vol. 38, No.2 thermore, the reading ability of many people with high school diptoma is closer to a grade eight than a grade 12 level, So, the fact that you or your colleague can understand the items is no guarantee that the average respondent can. What this means is that the scale should have been pre-tested on the target group, to determine if the terms are comprehensible. If there is no assurance that this was done, the potential cest user should take this step. Lack of Ambiguity Ano” response (o the question, “{ like my mother,” does not necessarily mean that he or she hates mom. Rather, it may simply reflect the fact that the subject's mother is dead, and the person did not feel free to interpret the statement in the past tense, Similarly, if a questionnaire to assess knowledge of side-effects asks whether or not they were explained by the physician, the person may have difficulty answering if the nurse did all the explaining: yes, the side-effects were de- scribed, but no, it wasn't done by the doctor. Items like these are ambiguous. in that itis not always clear to the subject how she or he should answer, nor is the investigator always clear what the answer denotes. As with problems in comprehension, items should have been pre- tested on people similar to the intended users, to eliminate such items or this must be done by the scale's user. Value Laden or Offensive Content How would you expect a patient t0 respond to the item, “I often bother my doctor with trivial complaints”? Nobody wants {0 admit bothering his or her physician, especially if the complaints are srivtal. These terms, as well as a host of others, lead the respondent to answer as much to the tenor of the question as to the content. Words which signal to the respondent which answers are desirable and which are net introduce by a definite bias into the questionnaire should not appear in items. The original version of the MMPI (20) contained anumbe: of items tapping religious beliefs. Although this could be justified on psychometric grounds, in that many patients exhibit an increase in religiosity just prior ta a psychotic break, the items were sometimes seen as an unwarranted intrusion into personal affairs and made an easy target for those opposed to the use of personality testing. The revised version (11) dropped these offending items, increasing the acceptance of the test. ‘The potential user of questionnaires should scan the scales (or preferabiy, have a lay person do 50), to determine if any items appear leading, value laden, or offensive. Reliability Reliability means that “measurements of individuals on different occasions, or by different observers, or by similar or parallel tests, produce the same or similar results” (8). if the subject has not changed, then two differeat measurements should yield similar scores on the scale. If this has not been. demonstrated, then the potential user of the test should stop right there and refegate it to the wastepaper basket. An article which inaroduces anew scale and does not reportits reliability March, 1993 is hiding something, and should be read with a healthy dose of scepticism. ‘There are 2 aumber of different indices of reliability, exch of which reflects a different attribute of the test. Not all are necessary for any one scale: an instrument that is self-admin- istered, for example, does not have to demonstrate inter-rater reliability. In this section, we will go through some of the more common forms of reliability and how they should be measured. Internal Consistency Ifa scale is measuring a single attribute, such as seif-es- teem or anxiety, then all of the items should be tapping it to varying degrees. So, if a person states on one item that he or she is tense. the person should also endorse other items which ask about nervousness, being on edge, and soon. To the extent that items on the scale are not answered in a consistent manner, the instrument is either measuring a number of different things 0° the subjects are responding in an inconsis- tent manner, perhaps because the items are badly worded Since the early 1950s, the most widely used measure of a scale’s intemal consistency is Cronbach's a (alpha), also referred io as coefficient c (21). Alpha, like all coefficients of reliability, can range between 0.00 and 1.00, Scales with coefficients under 0.70 should usually be avoided. However, o. usually increases with the number of items in the scale, so 20-item tests often have values in the 0.90s. A high value-of & is necessary, but not sufficient, since it yields the most “optimistic” estimate of scale’s reliability; that is, one that is higher than the other indices discussed below. Test-Revest Reliability A.scale should be stable over time, assuming that nothing has changed in the interim. Ifa patient completes a depression inventory today, and again next week, then the results should reflect an equivalent degree of dysphoria (again assurning the person hasn't gotten any better or worse). If the score is different in the absence of change, then it is impossible to know which number is an accurate reflection of the person's status. ‘The acceptable value of the test-retest reliability coeffi- cient depends on two factors: the interval between the two administrations, and the hypothesized stability of the trait being measured. Unless the test is very long, the retest should not have occurred less than two weeks after the first testing. Otherwise, the person may have been recalling his or her previous answers and have repeated these, rather than having responded to the questions themselves. (We assume that the person can’t remember responses to 100 or so items, so longer tests can have a shorter retest interval). The interval'should not have been so long that whatever was being measured may have changed. Within these limits, test-retest reliabilities in the 0.60s are marginal, those in the 0.70s are acceptable. and anything over 0.80 would be considered to be very high. Inter-Rater Reliability Some indices, such as the Hamilton Rating Scale for Depression (2). are completed by a rater rather than by the Raring Scans CHECKLIST 143 patient. The issue here is that two or more observers, indepen- dently evaluating the same person, should come up with similar scores. AS with test-retest reliability, a lack of agree- ment usually sneans that none of the results can be trusted. Acceptable values for inter-rater reliability are similar to those for test-retest reliability. However, if each rater must interview the patient independently in order to complete the scale (as opposed to the raters assessing the patient simulea- neously through a one-way mirror or based on previous observations), then somewhat lower reliability values are expected, The reason is that there are now two sources of error instead of one: the raters are different; and the patients may have changed in the interim or altered their responses for the different evaluators, Even so, the inter-rater reliability should never drop below 0.60, and shouid ideally be at least in the low 0.805. ‘A note of caution is in order for the person who wants to use an observer-based scale, Just because the articte reported a high inter-rater reliability does not mean that you will necessarily achieve the same level. Most of these scales require some degree of observer training; if the training is inadequate or inaccurate, then the reliability will suffer ac- cordingly. By the same token, if you do a better job of training than the original authors, inter-rater reliability may actually increase, Choice of Sampie The reliability of an instrument depends on its ability to discriminate among people. Paradoxical as it may sound, if all of the subjects get the same score on the scale, then its reliability is zero, since it is not discriminating at all. The implication of this is that if a reliability study for an anxiety scale used a mixed bag of people, ranging from mountain climbers and sky divers to agoraphobics and obsessive com- pulsives, then the reliability estimate will be quite high However, if the scale is intended to be used only among patients suffering from anxiety disorders, then it will have much greater difficulty discriminating among these people who havea more restricted range of anxiety, and the reliability will drop. Conversely, an estimate based on a sample of patients in an anxiety disorders clinic will underestimate the scaie’s reliability when itis used ina general outpatient clinic. ‘There are two conclusions to be drawn from this. First, ‘when evaluating an article reporting reliability, make sure its sample is similar to the people you will use it with. Second, if your sample is different, it may be necessary to conduct your own reliabiligy study. However, if the test's standard deviations are known for the original sample and your sample, itis possible to estimate how much the reliability will change @. How was Reliability Calculated? ‘The most common way to calculate a reliability coefficient is to calculate a Pearson correlation coefficient between the two scores the test and retest scores. or those based on the two observers (22). Although it is the most widely used index, it is not necessarily the best for two reasons. First, it cannot readily handle the situation where there are more than two let Cananian Jounnat. OF Psvouare? Vel. 33, No. Score Given by Rater B Score Given by Rater A Figure 1. Line | shows perfect association and agrsements line 2 shows perfect association and no apre=meat. observers. The only way it can do so is by correlating rater A versus rater B, A versus C, and B versus C. Unfortunately, this violates one of the basic assumptions of the correlation: that all ofthe tests are independent, They obviously aze not. since each rater's data enter into the calculations twice. Second, Pearson correlation is Sensitive to differences in assactation, but not in agreement, Figure 1 helps to distinguish between these terms. In the situation shown by line !, there is pertect agreement and perfect association berween the two rates: whenever rater A assigned a score of 3 10 a subject, so dic Tater B, and similarly for all other ratings. With line 2, there is a systematic bias. Rater B's scores are always two points higher chan rater A's, so that whenever A gave a2, B gave a 4. In this case, the association is perfect (a high score from A is associated with a high one from B, and similarly for low scores), but the agreement is zero, they never give the same score to a subject. The Pearson rs in both of these cases is identical; 1.00. The same problem appears in a test-retest situation; if the scores on readministration are systematically higher or lower than at Time 1, Pearson's r would not pick this up. This doesn't mean that the test is unreljable: since people are rank ordered exacily the same both times, the test still conveys useful information, Since the Pearson correlation coefficient is insensitive 10 systematic biases between observers or administration times, it is somewhat optimistic. It would be preferable if authors report an intra-class correlation (ICC), which gives a more accurate estimate of reliability if any bias is present (8). Validity Now that the reader is assured that the test is measuring something reliably, the next question is, “What exactly is it measuring?” Validity addresses what conclusions can legiti- mately be drawn about people who attain various scores on the scale: for instance, if tke (est purports to measure social support, are we safe in saying that people who score his cn it actually have more social support. As with retiati there are a aumber of ways of assessing this, each telling something different about the scale, and not all of which are appropriate for any given instrument. As we will see, there is one major difference between determining reliability and assessing validity. In the former case. it is often sufficient for the test con- structor to have done one study and arrive at a definitive angwer. Validity is an ongoing process and there is rarely a time we can say “enough”, The reason is that we can always leam more about the types of conclusions we can draw, and this varies with the sample, the situation, time, and 2 host of other factors. ‘Traditionally, psychometricians have talked about the “tue Cs" of validity — content, criterion, and construct, validity. Although this has given way to a more global con- ception of validity testing (8), itis still easier to begin thinking about validity within the context of these three terms. To this “holy trinity” is added one other form of validity: face vatid- ity, March, 1993 Fem Autonomy 4 hem 2 Hem 3 Trem 20 | ay Figure 3) A coment valida mais Face Validiry Face validity addresses a very simple, qualitative question: do the items appear on the face of it. to be measuring what the scale says it measures? The presence of face validity does not improve the psychometric properties of the index, nor does its lack detract from it. Rather, it is more a matter of acceptability from the respondent's point of view. Thats, the in may be more willing to complete a depression inven- tory if the iterns seem to be asking about affect than if they seem irrelevant, An item such as, “My shirt collar seems looser now” may be excellent from a psychometric point of view, since a “yes” response may reflect appetize and weight loss; however, if the respondent feels that other items are silly and unrelated to his or her problems, the remaining questions may be answered randomly of not at all. There may be times where the test user wishes to avoid face validity. If there is concem that the person may deliberately falsify his or her responses by faking bad or faking good in order to ichieve some aim (for example, to ensure hospitalization of to get discharged), then it may be wise to avoid scales which have high face validity. Content Vatidiry In order to be sure that a scale whick purports to measure family functioning is in fact tapping this area, the items should at 2 minimum meet two criteria: all celevant areas of family RamiNG SCALE CHECKLIST 1s functioning are assessed: and the items shouldn't be tapping topics shat are unselated. As with face validity, this is a jadgement call by the test developer and user, and is usually ot reduced to a single number. The best way to check the content validity of a scale is to construct a matrix, such as the ‘one in Figure 2. Each column represents a domain deemed to bbe important, and each row is an item. In a scale of family functioning, for example, the domains could be “communi- cation”, “role definition”, “autonomy” and so forth. To eval- vate the utility of an existing scale, these domains should be determined by the user of the test, rather than simply be a reflection of the developer's ideas. ‘Accheck mark is placed under the domain tapped by the item, Each item should have a check mark under one (and preferably, only one) domain; otherwise, its relevance for the test is questionable. At the same time, each domain should have at leasta few check marks under it al else being equal, the more items in a domain, the more reliably it's assessed (8), Naturally, this technique can be used only with items which have face validity: otherwise it would be difficuit to determine the correct domains they tap. Criterion Vatidiry ‘Atest which is supposed to tap an attribute such as suicidal potential should correlate with other, accepted measures of the same trait (23). If the scale being evaluated and the “gold standard” tect were both administered at the same time, chen this is called concurrent criterion validity. In some circum: stances, the criterion may only occur some time in the Future; the new test is evaluated by how well it predicts what the criterion score will be, Comelating scores on the suicide potential scale with the number of attempts over the succeed ing year would be an example of predictive criterion validity. Although the distinction between these two types of valid ity appears relatively minor, consisting of when the criterion is measured, the difference is acwally quite substantial. If a gold standard exists which can be given at the same time as, the new rest, then the question should arise, “Why develop ew measure at all?” The only reason (except for wanting 10 immortatizes one’s name as a test) is that the new measure is better —cheaper, faster, less invasive, or easier to use. In this case, the two tests should correlate quite strongly with one another — 0.80 or above. If the new scale is supposed to be more valid than the old one, then concurrent validity is necessary, but by no means sufficient. In this case, the two measures should not correlate too highly; if the coreiation between the new and gold standard tests is 0.70 or over, then the owo are tapping very much the same thing and there is litle room for the new test (o be any better than the original. If the correlation is too low (below 0.30 or so), then the measures are not related at all, meaning that the new scale is measuring something entirely different from the gold stan- dard, In this case, the test developer is looking for a correla- tion somewhere between 0.30 and 0.70. If the test developer used predictive validity, this implies that the behaviour or attribute being measured will oniy occur some time after the test is completed by the subject, and there 146 is no comparable measure that can be given at the same time, This type of validity is used most often in school admission tests (the criterion is whether or not the person will graduate ttuee or four years later), and diagnostic tests, where the criterion is whether or not the person later deveiops a disorder, or certain results show up on bivpsy or autopsy. In this case, the correlation should be quite high; at least 0.60 if the test is used for research purposes. and 0.85 or higher if clinical decisions are made on the basis of the test results. Construct Validity In psychiatry, the usual state of affairs is that either no other test exists which taps the same attribute, or the existing ones are inadequate for one reason or another; consequently, ctte rion validity is either impossible to establish or insufficient Moreover, the trait that is being studied is what is referred 10 in psychology as a “hypothetical constauet” (24) or a “con- struct” for short. Unlike blood pressure or height, for exam- ple, which can be easily seen and measured, constructs such as anxiety and intelligence are abstract variables which can- not be directly observed. We do not see anxiety: We note a variety of symptoms, signs, and behaviours, such as sweati ness, tachycardia, impaired concentration, or avoidance of certain situations. What we then do is infer that all of these are caused by an underlying process which we call “anxiety”. It is fair to say that most of the variables we deal with in psychiatry, such as “self-esteem”, “repression” or “locus of control", 2s well as many of the diagnostic groupings, are hypothetical consaructs. How would a person go about validating a new test in ary area such as somatic preoccupation (SP), No criterion exists except for clinical judgment, and that itself is Mlawwed. so the scale developer must use construct validity. Since a construct can be thought of as a “mini-theary”, the first step is co make some predictions based on the theory. These may include the following: people with a high degree of SP should see their family physicians more frequently than patients who have Jower SP scores; high SP people will have more over-the- counter medications in their homes than low SP people; or scores for high SP peopie should increase following the © announcement of a flu epidemic. Each of these predictions can be tested, and every study with a positive outcome strengthens the construct, and increases the validity of the seale, 7 It is obvious that a construct can tead to literally hundreds of hypotheses. Further, not all of these can be (or need to be) tested. As the reviewer of a scale, the questions you have to ask for each study are: is the study a good test of the constru and did the measure perform as it should have? Then, based on all of the validity studies which have be8n done, the third question is: have enough studies been done so that the scale can be considered valid (at least until contrary results come along)? The major point is that there is no study which can establish construct validity; it must be shown over a number of studies, tapping various aspects of the hypothetical con- struct. Moreover, what the test developer considers 10 be CANADIAN JOURNAL OF PSYCHIATRY Vol. 34, No. 2 sufficient proof of validity may not be the same as what the user of the scale deems necessary. The Validation Sample ‘One of the major areas about which atest developer and a test user ean disagree regarding the adequacy of a study is the choice of subjects on whom the scale was validated, Because of their ready availability, university students have served as a prime source of data. This is true even for scales which are meant to be used in clinical situations, such as the Social Avoidance and Distress Scale (25), or the Ego Identity Scale (26), The problems are twofold: university trained students may not interpret the questions the samme way that the general population does; and with few exceptions, not many will have the trait to the same degree as does the average person. Thus, they age rarely representative of the intended audience ‘Another major shortcoming with some validity studies is that they show oniy thatthe test can discriminate between two extreme populations, such as patients with penetrating head wounds versus normal individuals. Tests are rarely needed to distinguish these two groups from eack other, rather. they are used clinicaily where the large element of uncertainty, such as differentiating depression trom early dementia. Conse- quently, while it is necessary for a measure to discriminate between exireme groups (and indeed this is one form of construct validity), itis not sufficient; the test must be shown to work with the groups in which it will ultimately be used. Even when the sample was appropriate, it is important to remember that different iests which tap an area may have been validated on different samples. For instance, both the BDI (1) and the CES-D (27) are depression inventories. However, the former was validated on (and for) hospitalized psychiatric patients, and the latter on community-dwelling normal indi- viduals. The noms for the CES-D wouid be toually mislead- ing if it was used with patients, while the BDI would be inappropriate in studies of mood disturbances among newly- diagnosed MS patients or with any other non psychiatric group. Just as is the case with reliability itis easy to make a scale look valid if the samples are chosen judiciously. However, if the members bear little resernblance to the groups with whom the test will ultimately be used, the “validity” is more appar- ent than real. The conclusion is that the user should ensure not only that appropriate validity testing was done, but that meaningful groups of subjects are used. Utility yg ‘The last area a test user must think about is the utility of the measure. Even if itis highly reliable and valid, it may net be feasible if it requires two weeks to train interviewers, takes three or four hours to administer, and requires a $500 com- puter program to do the searing. Completion Time There is no absolute criterion for deciding if a test takes too long to complete. The answer depends on the setting, the acuity of the disorder, and the context in which the test is given. People who have to take time off work to attend an March, 1993 ‘outpatient clinic may feel burdened by @ questionnaire that keeps them there an additional ten minutes, while inpatients may weicome the break from routine offered by sitting in an office for the three houts required to complete the MMPI. Similarly, a patient in acute distress may find it difficult to sustain his or her attention long enough to complete a lengthy questionnaire. It is often the case in a research setting that a number of scales are given to the subjects. The issue then becomes the length of time that the entire battery takes, and the additional time required to complete another scale. ‘As we mentioned earlier, itis usually the rule that longer tests are more reliable and often more valid than shorter ones. But, based on considerations of time, it may be necessary at times to sacrifice some degree of validity for brevity. Training Time ‘An important issue in determining the utility of a testis the length of time necessary (0 master its use. Training may be required in two areas: learning how to administer the test, and how to interpret the results. At one extreme, some scales are simply given to the subject and the total checked agzinst norms to determine if the person’s score falls above or below some criterion (3). A1the other extreme, extensive training of an interviewer is needed, even if he or she is a mental health care professional and the interpretation may require a solid background in psychopathology (14). ‘Tests which are completed by the patient have the advan- tage that no time is required to leam how to administer them. However, they require the person to be literate, motivated, and able to concentrate, Those which use an interviewer ora rater (2) do not have these problems, but often require training to reach an acceptable level of inter-rater reliability. In some cases, simpiy reading a manual may suffice, but the requisite raining may be so intensive as t0 involve attendance at a workshop for three to five days (14,15). Mereover, as we mentioned in the second part of this paper, there is no guar antee that any two raters in a new setting will achieve the same inter-rater reliability that was reported by the test developer. Learning how to interpret a test may again involve nothing more than memorizing what the cut-off point is between normal and abnormal, It usually requires a minimum of one year of supervision to master personality inventories like the MMPI. If the test is used solely for research purposes, this, may not be an issue, since most often, only group data are compared, However, the aspect of training time may be an important consideration if an instrument is to be used for clinical purposes. The user of the test mast determine how much.time is necessary to lea how to use the test. This must then be weighed against the reported validity of the test, in compari- son to scales which may not be as valid, but which do not require as much training. Scoring Scoring some scales may simply consist of adding up the number of endorsed items, Other inventories may need either a computer program or dozens of hand-scoring templates. RATING SUALE CHECKLIST 1a? ‘This may be a critical component of its utility especially if the instrument is used with a large number of people. In summary, scales re useful components to most research and clinical programs. However, there is considerable vari~ ability from one to another regarding their psychometric properties and usefulness. There are two very helpful sources of information about these properties: the Metal Measure- ments Yearbook (MMY) (28) and Test Critiques (29). Both of these have new volumes every few years (MMY now has nine volumes, Test Critiques has seven), with reviews of copy. Tighted scales by experienced psychologists. Other books G0.31) are often more selective in the choice of tests listed, ‘but can provide very useful information. The potential user of a test would do well to invest some time with these and other sources to determine if it meets the minimum standards of reliability and validity, and if itis practical to use in their sicuation. References L. Beck AJ, Want CH, Mendelson M, et al. An inventory for measuring depression. Arch Gen Psychiatry 1961; 4: 561-571, 2. Hamilton M. A rating scale for depression. J Neurol Neurosurg Psychiatry 1960; 23: 56-62. 3. Spielterger CD, Gorsuch RL, Lushene RE. Manual for the Siate-Trait Anxiety Inventory. Pato Alto CA: Consulting Psy- chologists Press, 1970. 4. Kay SR, Fiszbein A. Opler LA. The Positive and Negative ‘Syndrome Scale (PANSS) for schizophrenia, Schizophr Bul 1987; 13: 261-276. 5, Herzberger SD, Chan E, Katz J. The development of an asser tiveness self-report inventory. J Pers Assess 1984; 48: 317-323, 6. Lamboum J, Gill D. A controlled comparison of simulated and veal ECT. BrJ Psychiatry 1978: 133: 514-519. Olmsted MP, DavisR, Rockert W.etal. Efficacy of brief grou, psychoeducational intervention for bulimia nervosa. Behav Res ‘Ther 1991; 29: 71-83. 8, Steiner DL, Norman GR, Health measurement scales: a pracy tical guide 0 their development and se, Oxtord: Oxford Uni verity Press, 1989. 9. Nunnally JC Jr, Introduetion to psychological measurement. New York: McGraw-Hill, 1970. 10, Anastasi A. Psychological testing, fifth edition. New York: Macmitian, 1982. 11. Butcher JN, Dahistrom WG, Graham JR. et al, Minnesota Multiphasie Personality Inventory (MMPI-2). Manual for ad- ministration and scoring. Minneapolis MN: University of Min- nesota Press, 1989. 12, Millon T, Manual for the MCMI-II. Minneapolis MN: National Computer Systems, 1987. 13, (Nideffer RM. Test of Attentional and Interpersonal Style. 2 Person Soc Psychol 1976; 34: 394-404, 14, Robins LN, Helzer HE, Croughan J, et a. NIMH Diagnostic Interview Schedule: Version ILL. Bethesda MD: NIMH, 1981. 15, Spitzer R, Endicott J. Schedule for Affective Disorders and ‘Schizophrenia (SADS). New York: New York State Psychiatric, Tastirue, 1982. 16. MeDonald RK. Problems in biologic research in schizophrenia. S Chronic Dis 1958: 8: 366-371. 143 CANADIAN JOURNAL OF PSYCHIATRY 17. Ullman LP, Giovannoni JM. The developrreat of a self-report measure of the process-reactive continuum, J Nerv Ment Dis 1964 138: 38-42, 18, Thoits PA. Conceptual, methodological. and theoretieat prob- Jems in studying social support as a bufler against life swess. J Health Soe Behav 1982; 23: 145-158. 19. Wallsion BS. Alagna SW, De Villis BM, #tal. Social supportand physical health. Health Psychol 1983; 2: 367-391. 20, Hathaway SR, McKinley JC, Manual for the Minnesota Multi- pphasic Personality Inventory, revised, New York: Psychological Corporation, 1951. 21. Cronbach LJ. Coefficient alpha and the intemal structure of tests, Psychometrika 1951; 16: 297-334. 22, Norman GR, Streiner DL. PDQ statistics. Toronto ON: B.C. Decker, 1986. 23, Beck AT, Steer RA. Beck Hopelessness Scale manual, New York: Psychological Corporation. 1988, 24, Cronbach LI, Mesh PE, Construct validity in psychological tests, Psychol Bull 1955; 52: 281-302. 25, ‘Watson D, Friend R. Measurement of social evaluation anxiety J Consuh Clin Psycho! 1969; 33: 448-457. 26, Tan AL. Kendis Ri, Fine JT, eral. Ashiort measure of Eriksonian ego identity. J Pers Assess 1917; 41: 279-284. 27. Radloff LS. The CES-D scale selfrepon depression scale for research in the general population. Applied Psycho! Measure ment 1977; 1: 385-401 na Vol. 38, Ne 28, Mitchell JV. Mental measurements yearbook. Lincoln N Buros Institute of Mental Measurements, 1985. 29. KeyserDS,Sweetland RC. Test critiques. Kansas City MO: 7, Corporation of America, 1984, . 30. Corcoran K, Fischer J. Mesoures for clinical practice: sourcebook. New York: Free Press, 1987. 31. McDowell 1, Newell C. Measuring health: a guide to ratic scales and questionnaires. Oxford: Oxford University Pres 1987. Résumé Les échelles d'évaluation sont d'une grande utilité tan pour les travaux cliniques que pour la recherche. Toutefois ciles varient grandement du point de vue de leur fiabitité, di feur validisé et de ieur wcilité. Le présent article sert de guide pour les personnes qui doivent évaluer les échelles, soit afi ide les intégrer dans leurs propres travaux de recherche ou teurs activités clinigues, soit pour déterminer Futile des résultats d’ études faisant appel d des écheiles. Les différents types de fiubilité er de validiré sont passés en revue, et des lignes directrices sone offertes en vue de leur évaluation, Enfin, il estquestiond ausres aspects des échelle’ qui peuvent influencer leur wtlité, par exemple le temps d exécution, la formation et ta facilité de notation.

You might also like