POTENTIAL BENEFITS AND LIMITATIONS OF RATING SCALES IN PSYCHIATRY Rating scales in psychiatry serve to standardize the information collected across time or observers. This standardization ensures a comprehensive evaluation that may aid treatment planning by establishing a diagnosis, ensuring a thorough description of symptoms, identifying comorbid conditions, and characterizing other factors affecting treatment response. In addition, it can establish a baseline for follow-up of the progress of illness over time or in response to specific interventions. This is particularly useful when several clinicians are involved, for instance in a group practice or clinic setting, or in the conduct of psychiatric research. In addition to standardization, most rating scales also offer the user the results of a formal evaluation of their performance characteristics. This means that the clinician can know to what extent a given scale produces reproducible results (reliability) and how it compares with more definitive or established ways of measuring the same thing (validity). Rating scales also offer some practical advantages. First, they can save valuable physician time: self-administered rating scales can be administered in the waiting room, or a nurse or technician can administer an interview prior to a session with the physician. In addition, rating scales may make it easier to obtain information about sensitive areas such as cognitive decline or sexual functioning in which direct questioning is sometimes experienced as more intrusive. However rating scales are not a panacea. They can provide erroneous measurements because of difficulties in administration or limitations in the underlying construct. In this respect they do not differ from clinical assessments, but they may appear to provide more definitive information and thus give a spurious sense of security. At the practical level, they take time that might better be devoted to other pursuits. The critical decision about using a formal assessment tool in clinical practice is whether on balance it contributes useful information in an efficient manner. This decision depends on the specific clinical setting and goal, the practical attributes of the scale, and its psychometric properties. TYPES OF SCALES AND WHAT THEY MEASURE Scales are used in psychiatric research and practice to achieve a variety of goals. They also cover a broad range of areas and use a broad range of procedures and formats. Measurement Goals Most psychiatric rating scales in common use fall into one or more of the following categories: making a diagnosis (e.g., the Structured Clinical Interview for DSM-IV [SCID] or the Diagnostic Interview Schedule for Children [DISC]); measuring severity and tracking change in specific disorders (e.g., the Hamilton Rating Scale for Anxiety [HAM-A] or the Mini-Mental State Examination [MMSE]) or general symptoms (e.g., the Symptom Checklist-90) or in overall outcome (e.g., the Behavior and Symptom Identification Scale [BASIS-32]); screening for conditions that may or may not be present (e.g., the CAGE or the Zung Self-Rating Depression Scale). Constructs Assessed Psychiatric practitioners and investigators assess a broad range of areas, referred to as constructs to underscore the fact that they are not simple, direct observations of nature. These include diagnoses, signs and symptoms, severity, functional


impairment, quality of life, and many others. Some of these constructs are fairly complex and are divided into two or more domains (e.g., positive and negative symptoms in schizophrenia, or mood and neurovegetative symptoms in major depression). Many scales yield separate scores, or subscales, for each domain. Especially when these domains are seen as substantially independent, they may be referred to as dimensions (e.g., Axis I and Axis II in the fourth edition of Diagnostic and Statistical Manual of Mental Disorders [DSM-IV], multidimensional personality traits). Categorical Versus Continuous Classification Some constructs are viewed as categorical, or classifying, while others are seen as continuous, or measuring. Categorical constructs describe the presence or absence of a given attribute (e.g., competency to stand trial) or the category best suited to a given individual among a finite set of options (e.g., assigning a diagnosis). Continuous measures provide a quantitative assessment along a continuum of intensity, frequency, or severity. In addition to symptom severity and functional status, multidimensional personality traits, cognitive status, social support, and many other attributes are generally measured categorically. The distinction between categorical and continuous measures is by no means absolute. Ordinal classification, which uses a finite, ordered set of categories (e.g., unaffected, mild, moderate, severe), stands between the two. In addition, a cutpoint is frequently used with a continuous or ordinal scale to indicate a threshold for membership in a corresponding category. For instance, individuals with Mini-mental State Examination scores below 24 may be considered to have a dementia, or those with Hamilton Rating Scale for Depression (HAM-D) scores above 8 may be considered to have an episode of major depression. Measurement Procedures Rating scales differ in measurement methods. Issues to be considered include format, raters, and sources of information. Format Rating scales are available in a variety of formats. Some are simply checklists or guides to observation that help the clinician achieve a standardized rating. Others are selfadministered questionnaires or tests. Still others are formal interviews that may be fully structured (i.e., specifying the exact wording of questions to be asked) or partly structured (i.e., providing only some precise wording, along with suggestions for additional questions or probes). Whether fully structured or not, instruments may be written so that all questions are always included, or they may have formal skip-out sections to limit administration time. Individual items also vary in their format. Most commonly, scales use yes-no or multiple choice questions. Often, answers are graded on a Likert scale, an ordinal scale with three to seven points that measures severity, intensity, frequency, or other attributes. Likert scales are most often partially or fully anchored, assigning a meaning to each numeric level. The same anchors can apply to all items or the instrument may provide specific anchors for each. Occasionally, questionnaires include open-ended questions, especially at the beginning, which may be used to help establish rapport. In semistructured or unstructured interviews, this information also serves to guide the rest of the interview and aids in forming a clinical impression about the patient. Raters Some instruments are designed to be administered by doctoral-level clinicians only, while others may be administered by individuals such as psychiatric nurses or social workers with more limited clinical experience. Still other instruments are designed


primarily for use by lay raters with little or no experience with psychopathology. In general, more training is required to administer less-structured scales. In addition, some scales require extensive training, even for experienced clinicians, to master the appropriate procedures and achieve a good result. Virtually all scales perform better when raters are familiar with their format and specific content. Source of Information Instruments also vary in the source of information used to make the ratings. Information may be obtained solely from patients, who generally know the most about their condition. In some instruments, some or all of the information may be obtained from a knowledgeable informant. When the construct involves limited insight (e.g., cognitive disorders or mania) or significant social undesirability (e.g., antisocial personality, substance abuse), other informants may be preferable. Informants may also be helpful when the subject has limited ability to recall or report symptoms (e.g., delirium, dementia, or any disorder in young children). Some rating scales also allow or require inclusion of information from medical records or patient observation. ASSESSMENT OF RATING SCALES In clinical research, rating scales are mandatory to ensure interpretable and potentially generalizable results and are selected on the basis of coverage of the relevant constructs, expense (based on the raters, purchase price, if any, and necessary training), length and administration time, comprehensibility to the intended audience, and quality of the ratings provided. In clinical practice, one considers these factors and also whether a scale would provide more or better information than would be obtained in ordinary clinical practice or contributes to the efficiency of obtaining that information. In either case, the assessment of quality is based on psychometric, or mind measuring, properties. Psychometric Properties The two principal psychometric properties of a measure are reliability and validity. Although these words are used almost interchangeably in everyday speech, in the context of evaluating rating scales they are distinct. To be useful, scales should be reliable, or consistent and repeatable even if performed by different raters, at different times, or under different conditions, and they should be valid, or accurate in representing the true state of nature. Relation Between Reliability and Validity Establishing a measure's reliability is generally considered primary, since it is difficult to reach valid judgments without first achieving consistency. However, problems with reliability can be overcome to an extent by combining information from several assessments. Unfortunately, improved reliability does not guarantee improved validity, and some efforts to improve reliability may actually limit validity. For example, a personality disorder instrument might focus on overt behaviors rather than inner thoughts and feelings to achieve higher reliability but at the cost of losing some of the most valid information about personality. Even with clinically trained raters, it is particularly difficult to achieve reliability on items requiring subjective clinical judgment (e.g., feelings evoked in the examiner): nonetheless, when used by experienced diagnosticians, such items may contribute substantially to valid diagnoses. Reliability Reliability refers to the consistency or repeatability of ratings and is largely empirical. In the categorical context, it refers to whether agreement can be reached on the classification of each individual. In the continuous context, it refers to whether agreement can be reached on the assignment of a given score. It can also be seen as precision; that


is, whether a measure yields a ballpark estimate or a finely graded score. An instrument is more likely to be reliable if the instructions and questions are clearly and simply worded and the format is easy to understand and score. There are three standard ways to assess reliability: internal consistency, interrater, and test-retest. INTERNAL CONSISTENCY Internal consistency assesses agreement among the individual items in a measure. This provides information about reliability because each item is viewed as a single measurement of the underlying construct; thus, the coherence of the items suggests that each of them is measuring the same thing (and hence all of them are). Internal consistency is measured most often with coefficient alpha (also known as Cronbach's alpha), which ranges between 0 and 1; values of .75 or above are considered good. However, the internal consistency of a measure depends on the internal consistency of the construct that the measure purports to assess and is higher for unidimensional constructs than those with two or more relatively independent domains. INTERRATER AND TEST-RETEST RELIABILITY Interrater (also called interjudge, or joint) reliability is a measure of agreement between two or more observers using the same information to evaluate the same subjects. Estimates may vary with assessment conditions; for instance, estimates of interrater reliability based on videotaped interviews tend to be higher than those based on interviews conducted by one of the raters. Interrater reliability tends to be higher than test-retest reliability, a measure of agreement between evaluations at two points in time, in which the information obtained may differ (e.g., be associated with differences in interviewer skill, interviewer mood, room conditions, or subject's attitude). In addition, test-retest evaluations measure reliability only to the extent that the subject's true condition remains stable in the time interval, which is problematic for many conditions but virtually impossible for rapidly fluctuating conditions like state anxiety. However, because the test-retest situation more closely reflects the clinical problems associated with serial evaluations by multiple clinicians, to the extent that concerns about interval change can be eliminated, it is generally a more useful indicator of reliability in practice. Interrater reliability and test-retest reliability of continuous constructs are measured with the intraclass correlation coefficient (ICC), while those of categorical constructs are measured with the kappa coefficient (k). A weighted version of k is available to penalize large disagreements more than small ones (e.g., between schizophrenia and psychotic depression compared with schizophrenia and schizoaffective disorder). Both k and the ICC are measures of agreement corrected for the agreement expected by chance alone and both range from 0 to 1. As a rule of thumb, a k or ICC above .8 is considered excellent, those in the .7 to .8 range are considered good, and those in the .5 to .7 range are considered fair. However, the degree of reliability required varies with the clinical purpose; extremely reliable ratings are required before administering potentially dangerous treatments, while more modest reliability may suffice for estimating rates in a population. ISSUES IN INTERPRETING RELIABILITY DATA When interpreting reliability data, remember that reliability estimates published in the literature may not generalize to other settings. Factors to consider are the nature of the sample, the training and experience of


the raters, and the test conditions. Issues regarding the sample are especially critical. In particular, reliability tends to be higher in samples with high variability, in which it is easier to discriminate among individuals. Thus, for continuous measures, reliability tends to be higher when the sample includes individuals with a wide range of scores. For categorical measures, reliability tends to be higher when the prevalence of the attribute being measured is fairly high. Reliability estimates also depend on the fraction of difficult cases (e.g., individuals near a diagnostic threshold or those resistant to being interviewed), since large numbers of these tend to diminish observed reliability. Validity Validity refers to conformity with truth or a gold standard that can stand for truth. In the categorical context, it refers to whether an instrument can make correct classifications. In the continuous context, it refers to accuracy, or whether the score assigned represents the true state of nature. While reliability is empirical, validity is partly theoretical; many constructs measured in psychiatry have no absolute truth. Even so, some measures yield more useful and meaningful data than others. Validity assessment is generally divided into face and content validity, criterion validity, and construct validity. FACE AND CONTENT VALIDITY Face validity refers to whether the items appear to assess the construct in question. Although a rating scale may purport to measure a construct of interest, a review of the items may reveal that it embodies a very different conceptualization of the construct. For instance, an “insight” scale may define insight in either psychoanalytic or neurological terms. However, items with a transparent relation to the construct may be a disadvantage when measuring socially undesirable traits such as substance abuse or malingering. Content validity is similar to face validity but describes whether the measure provides good balanced coverage of the construct and is less focused on whether the items give the appearance of validity. Content validity is often assessed with formal procedures such as expert consensus or factor analysis. CRITERION VALIDITY Criterion validity (sometimes called predictive or concurrent validity) refers to whether or not the measure agrees with a gold standard or criterion of accuracy. Suitable gold standards include the long form of an established instrument for a new shorter version, a clinician-rated measure for a self-report form, and blood or urine tests for measures of drug use. For diagnostic interviews, the generally accepted gold standard is the longitudinal, expert, all data (LEAD) standard, which incorporates expert clinical evaluation, longitudinal data, medical records, family history, and any other sources of information. When comparing continuous measures with a gold standard, a correlation coefficient is the statistic most often reported. For categorical variables such as diagnoses (or continuous measures with a cutpoint), sensitivity and specificity are the statistics of choice. Sensitivity refers to the test's ability to identify true cases, or its true positive rate. Specificity is the test's accuracy in identifying noncases, or one minus the false positive rate. In general, the more sensitive a test, the less specific it is. If the threshold for diagnosis, for example, is lowered, more cases are detected but at the expense of some false positives; if the threshold is raised to decrease the number of false positives, true cases are inevitably missed. The optimal threshold depends on the consequences of false positives and false negatives. CONSTRUCT VALIDITY When an adequate gold standard is not available—a frequent state of affairs in psychiatry—or whenever additional validity data are desired, construct


validity must be assessed. To accomplish this, one can compare the measure with external validators, attributes that bear a well-characterized relation to the construct under study but are not measured directly by the instrument. External validators used to validate psychiatric diagnostic criteria and the diagnostic instruments that aim to operationalize them include course of illness, family history, and treatment response. For example, when compared with schizophrenia measures, mania measures are expected to identify more individuals with a remitting course, a family history of major mood disorders, and a good response to lithium. Two special cases of assessing validity using external validators have particular relevance for clinical psychiatry. One is discriminant validity, which examines a measure's ability to discriminate between populations that are expected to differ on the construct of interest. For example, does a sociopathy measure correctly separate individuals in jails from those living in the community? Although such discriminations are important in clinical practice, the true test of a measure is its ability to discriminate at the margins. A study of discriminant validity is more clearly relevant if it includes the types of cases encountered in clinical practice (e.g., psychotic depression versus schizoaffective disorder) rather than more-easily discriminated populations (e.g., psychotic depression versus normal). Another special case is sensitivity to change; the fact that a measure shows expected changes (e.g., an improvement with an efficacious treatment or a decrement with a progressive disease) can be a strong validator. When assessing validity in areas with few established measures and no gold standard or criterion of accuracy can be established, the assessment of the validity of the measure is limited by the validity of the construct itself. Nonetheless, by triangulating between a better definition of the construct, better ways to measure it, and better exploration of how it operates in clinical practice and research, the field moves to greater validity over time. SELECTION OF PSYCHIATRIC RATING SCALES The rating scales used in psychiatric practice and research presented below are grouped by topic, beginning with such general issues as diagnosis, functioning, symptom severity, and side effects and then proceeding to specific diagnostic groups, organized according to the section of DSM-IV. The selection was made on the basis of coverage of major areas and common use in clinical research, current (or potential) use in clinical practice, or both. A brief discussion of measurement issues for each area is followed by a description of each instrument, its psychometric properties, and its potential uses. Whenever possible, a brief, clinically useful instrument is provided in each area. References for each measure, organized by topic, are listed in Table 7.8-1. These references include moredetailed information about each measure and its psychometric properties and may also provide either the measure itself or instructions for obtaining a copy. Functional Status, Impairment, and General Symptom Severity The broad area of functional status, impairment, and general symptom severity cuts across a variety of diagnoses and is thus useful for grading patients by functional status or overall severity without reference to specific symptomatology. The instruments presented here have a strong mental health focus and often include items on psychiatric symptomatology. Instruments focused on more-global functioning or domains such as mobility and selfcare are not generally included here.


Global Assessment of Functioning (GAF) Scale and Social and Occupational Functioning Scale (SOFAS) The GAF, was developed in the early 1990s to rate Axis V of DSM-IV and provides a measure of overall functioning related to psychiatric symptoms. The GAF is extremely similar to the Global Assessment Scale (GAS) used for the same purpose in the third edition of DSM (DSM-III) and the revised third edition (DSM-III-R), from which it was derived. A related instrument is the SOFAS, proposed as a new axis in Appendix B of DSM-IV, which focuses only on functioning and not on symptoms and does not try to discriminate between functional changes related to psychiatric and nonpsychiatric causes . Both scales are clinician rated on a 100-point scale based on all available information, with clear descriptions of each 10-point interval. Ratings are generally made for the past week, but longer intervals (e.g., highest during the past year) can be used. Instructions for rating the GAF and SOFAS are included in DSM-IV; clinician raters do not require additional training to use these scales. The GAS has received more extensive evaluation and shows fair-to-good reliability and good validity judged against clinician ratings of the degree of impairment. GAF or GAS ratings are often required for billing purposes. In addition, the scales have been used to track change with treatment in inpatient and outpatient practice and in multiple research studies. The major criticism of the GAS and GAF is that they tend to confound symptoms and functioning, so that individuals with significant symptomatology (e.g., fixed delusional system) score low even when their social and occupational functioning is relatively spared. Global Assessment of Relational Functioning The GARF was developed in the late 1980s to provide a measure of the quality of functioning in relationships analogous to the measure of individual functioning provided by the GAF or SOFAS. It was subsequently included in Appendix B of DSM-IV as an additional axis for further consideration. It provides a global rating on a 100-point scale based on a review of three major areas: problem-solving, organization, and emotional climate. Anchors are provided for each quintile of each domain. The GARF is focused on the particular needs of family and couple therapists but can be rated by any clinician. Ratings are generally based on the present, but alternate periods (e.g., the past year, or in the period following a major stressor) can be implemented. The GARF has not been extensively evaluated, but preliminary evidence suggests that clinician and even nonclinician raters can achieve good to excellent reliability with only minimal training. The validity of the GARF is supported by expected correlations with other measures of family and couple distress and functioning. The GARF shows promise for rating relational functioning but only time will tell whether it will prove useful in clinical or research practice. Behavior and Symptom Identification Scale The BASIS-32 was developed in the early 1990s to provide a broad but brief overview of psychiatric symptoms and functional status from the patient's point of view for use in assessing the outcome of psychiatric treatment. The instrument assesses a wide range of areas, including family and work relationships; ability to complete regular tasks at home, work, or school; and symptoms of anxiety, depression, psychosis, and substance abuse. Each item is rated on a five-point scale focused on the degree of difficulty during the preceding week. BASIS-32 can be completed as a paper and pencil test (requiring 5 to 20 minutes) or the questions can be


read aloud with the patient selecting the best answer from a laminated card (requiring 15 to 20 minutes). It can be scored readily by hand. A computerized scoring system is also available. BASIS-32 generates an overall score and five subscales: relation to self and others, daily living and role functioning, depression and anxiety, impulsive and addictive behavior, and psychosis. Good reliability and validity have been demonstrated. Its simple administration, brevity, and broad coverage make it well suited to its original task, and it is frequently used at baseline, during, and after treatment to monitor progress. It can provide valid ratings across a wide range of psychiatric impairment but is not generally suitable for individuals with substantial cognitive impairment. It is also not suitable for children under age 14. Symptom Checklist-90 Revised (SCL-90-R) and Brief Symptom Inventory (BSI) The SCL-90-R was developed in the mid-1970s from the older Hopkins Symptom Checklist, a multidimensional measure of the severity of psychopathology. (SCL-90 refers to a very similar earlier version but is frequently used to denote the current version as well.) The BSI was developed in the early 1980s as a short form of the SCL-90-R. Both cover the following domains: depression, anxiety, phobia, psychoticism, paranoia, obsessivecompulsive, hostility, somatization, and interpersonal sensitivity. Even the longer SCL90-R fits on two sides of a single sheet of paper and can be completed in 20 minutes or less. The SCL-90-R is a self-report measure with 90 items Likert-scaled from 0 to 4 on the basis of the distress caused over the past week. The BSI is very similar, but has 53 items. Hand scoring is relatively simple, and computerized scoring is available. Both yield t scores based on extensive normative data for each subscale, a Global Severity Index, a Positive Symptom Distress Index, and a Positive Symptom Total. Reliability is fair to good, depending on the subscale, and most subscales appear reasonably valid when assessed against more specific measures (e.g., the depression scale against a HAMD). The principal use of the SCL-90-R and BSI has been in characterizing psychopathology in treatment and other studies, but they have sometimes been used in primary care clinics as a screening tool for psychopathology. However, caution may be warranted in this setting because their sensitivity may be limited in some areas. In addition, the subscales bear only a modest relation to the corresponding DSM-IV disorders. Side Effects The instruments described below are used to detect and quantify side effects from psychiatric medications, specifically motor effects of antipsychotics. Because of the focus on motor symptoms, these scales include a brief, focused physical examination as well as questions posed to the patient. Abnormal Involuntary Movement Scale (AIMS) The AIMS is a clinical examination and rating scale that was developed in the 1970s to measure dyskinetic symptoms in patients taking antipsychotic drugs. The AIMS has 12 items, each of which is rated on an item-specific five-point severity scale ranging from 0 to 4. Total scores are not generally reported. Instead, changes in global severity and individual areas can be monitored over time. Ten items cover the movements themselves, divided into sections rating global severity and those related to specific body regions; two items concern dental factors that can complicate the diagnosis of dyskinesia. In the presence of extended neuroleptic exposure and the absence of other conditions causing dyskinesia, mild dyskinetic movements in two areas or moderate movements in one area suggest a diagnosis of


tardive dyskinesia. The AIMS was developed for clinician raters, but lay raters can be trained to use it. It can be completed in under 10 minutes. Excellent reliability has been demonstrated, especially for experienced raters, and the instrument appears valid. In many clinical settings, the AIMS is considered standard clinical practice for patients receiving long-term neuroleptic drugs and is useful in clinical practice and research, both for monitoring patients for the development of tardive dyskinesia and for tracking changes in tardive dyskinesia over time. Simpson-Angus Rating Scale for Extrapyramidal Side Effects The Simpson-Angus scale was developed to monitor the effects of antipsychotic drugs. It has 10 items, each of which is rated on an item-specific, five-point severity scale ranging from 0 to 4. Scores are reported as the mean on all 10 items, with 0.3 considered the upper limit of normal. It is strongly focused on parkinsonian symptoms, particularly rigidity, but includes one akathisia item. It is designed for clinician use but can be administered by trained lay raters and takes about 10 minutes to administer. Good reliability has been reported, and validity is supported by the correlation of scores with antipsychotic drug dose. The scale is useful in a wide variety of clinical settings to monitor parkinsonian adverse effects and the impact of interventions to treat these effects. Psychiatric Diagnosis Instruments assessing psychiatric diagnosis are central to psychiatric research and may have utility in clinical practice as well. However, they tend to be rather long, especially for individuals reporting many symptoms, who may require many follow-up questions. When evaluating such instruments, one must be sure that they implement current diagnostic criteria and cover the diagnostic areas of interest. For instance, few cover personality disorders (with the exception in some cases of antisocial personality), and not all cover disorders that typically begin in childhood. Structured Clinical Interview for DSM-IV The SCID was developed in the early 1990s to provide a standardized DSM-III-R Axis I diagnosis based on an efficient but thorough clinical evaluation. It has since been updated for DSM-IV. The semistructured diagnostic interview begins with a section on demographic information and clinical background. Then there are seven diagnostic modules focused on different diagnostic groups: mood, psychotic, substance abuse, anxiety, somatoform, eating, and adjustment disorders. Both required and optional probes are provided, and skip-outs are suggested when no further questioning is warranted. All available information, including that from hospital records, informants, and patient observation, should be used to rate the SCID. The SCID is designed to be administered by experienced clinicians and is generally not recommended for use by lay interviewers. In addition, formal training in the SCID is required, and training books and videos are available to facilitate this. In individuals without symptoms, the interview takes approximately 1 hour, but it may take up to 3 hours in individuals with extensive symptomatology. Although the primary focus is research with psychiatric patients, a nonpatient version (with no reference to a chief complaint) and a more clinical version (without as much detailed subtyping) are also available. Reliability data on the SCID suggest that it performs better on more-severe disorders (e.g., bipolar I disorder, alcohol dependence) than on milder ones (e.g., dysthymia). Validity data are limited, as the SCID is more often used as the gold standard to judge other instruments. It is considered the standard interview to verify diagnosis in


clinical trials and is extensively used in other forms of psychiatric research. It can also be used to ensure a systematic evaluation in psychiatric patients; for instance, on admission to an inpatient unit or at intake into an outpatient clinic. It is also used in forensic practice to ensure a formal and reproducible examination. Diagnostic Interview Schedule (DIS) and Composite International Diagnostic Instrument (CIDI) The DIS and CIDI are fully structured diagnostic interviews designed for lay administration. The DIS was developed in the 1980s for use in the Epidemiologic Catchment Area (ECA) study in the United States, which aimed to assess rates of current and lifetime psychiatric illness according to DSM-III in a large and diverse community sample, and has since been updated for DSM-III-R and DSM-IV criteria. In 19 diagnostic modules, it covers a broad range of Axis I conditions in adults, plus several childhood disorders and antisocial personality. The most recent version also includes more information about symptoms, impairment, and treatment. The CIDI was developed from the DIS for international use and covers both ICD and DSM criteria in 11 diagnostic modules; the CIDI does not cover antisocial personality or childhood disorders. The instruments are fairly similar, and both involve verbatim reading of questions with little or no rewording allowed; only specified probes may be used for follow-up. The DIS takes 90 minutes to 2 hours; the CIDI may be somewhat shorter. Both can be scored by computer, yielding diagnoses and symptom profiles. A computerized, self-administered version of the CIDI is also available. The instruments are designed for use by lay interviewers with extensive training in their use: formal training is recommended. Reliability appears to be good for both, at least for more-severe disorders. Validity appears problematic for the DIS; studies of agreement with clinician diagnoses have yielded inconsistent results, with marked discrepancies often observed for psychotic disorders. The validity of the CIDI is still being evaluated. Both instruments have been used extensively in psychiatric research, particularly in epidemiological settings and provide valuable data. However, some caution is warranted in interpreting these data, given concerns about the instruments' validity. Primary Care Evaluation of Mental Disorders (PRIME-MD) The PRIME-MD was developed in the mid-1990s to provide an efficient screening and evaluation tool for common mental disorders seen in the primary care setting. The instrument has two parts: a 25-item patient questionnaire that screens for a range of symptoms and a structured interview designed to follow up on any symptoms identified in the patient questionnaire or through other means. The interview has five modules covering mood, anxiety, alcohol, somatoform, and eating disorders. The patient questionnaire is very brief and can be completed in under 5 minutes. If follow-up is required, a primary care practitioner can complete the structured questionnaire in about 10 minutes. Training in the use of the instrument or careful review of the instruction manual is recommended. A self-report version of the structured interview is in development. Reliability appears to be fair to good, better for more severe diagnoses. Validity judged against psychiatrist evaluations is quite good. The screening instrument has appropriately high sensitivity, and the followup interview provides reasonable specificity. The PRIME-MD appears to be useful for primary care settings and may also be useful in psychiatric practice when a quick screen is desired. However, its utility for the latter purpose is limited by its lack of coverage of


the more-severe psychopathology not typically seen in primary care (e.g., psychotic symptoms, mania). Psychotic Disorders A variety of instruments are used for patients with psychotic disorders. Those reported here are symptom severity measures. A developing consensus suggests that distinguishing positive and negative symptoms in schizophrenia is worthwhile, and more-recently developed instruments implement this distinction. Because patients with psychotic disorders often lack insight and are sometimes agitated, patient observation is required in addition to direct questioning. Thus, most instruments in this domain must be administered by psychiatrists or others with clinical training. Brief Psychiatric Rating Scale (BPRS) The BPRS was developed in the late 1960s as a short scale for measuring the severity of psychiatric symptomatology. Developed primarily to assess change in psychotic inpatients, it covers a broad range of areas including thought disturbance, emotional withdrawal and retardation, anxiety and depression, and hostility and suspiciousness. Its 18 items are rated on a seven-point itemspecific Likert scale from 0 to 6, with the total score ranging from 0 to 108 (in some scoring systems, the lowest level for each item is 1, and the range is 18 to 126). Because the ratings include observations as well as patient reports of symptoms, the BPRS can be used to rate patients with very severe impairment. It is intended for use by experienced clinicians and can be administered in 30 minutes or less, including patient interview and observation. Reliability of the BPRS is good to excellent when raters are experienced but is more difficult to achieve without substantial training; anchored versions and a semistructured interview have been developed to increase reliability. Validity is also good as measured by correlations with other measures of symptom severity, especially those assessing schizophrenia symptomatology. The principal use of the BPRS is as an outcome measure in treatment studies of schizophrenia, and it functions well as a measure of change in this context. However, it has been largely supplanted in morerecent clinical trials by the newer measures described below. In addition, given its focus on psychosis and associated symptoms, it is only suitable for patients with fairly significant impairment. Its use in clinical practice is less well supported, in part because considerable training is required to achieve the necessary reliability. Positive and Negative Syndrome Scale (PANSS) The PANSS was developed in the late 1980s to remedy perceived deficits in the BPRS in the assessment of positive and negative schizophrenia and other psychotic disorders by adding additional items and providing careful anchors for each. The PANSS includes 30 items on three subscales: 7 items covering positive symptoms (e.g., hallucinations and delusions), 7 covering negative symptoms (e.g., blunted affect), and 16 covering general psychopathology (e.g., guilt, uncooperativeness). Each item is scored on a seven-point item-specific Likert scale ranging from 1 to 7; thus the positive and negative subscales each range from 7 to 49, and the general psychopathology scale from 16 to 112. The PANSS requires a clinician rater because considerable probing and clinical judgment are required. A semistructured interview guide is available. The ratings can be completed in 30 to 40 minutes. Reliability for each scale is fairly high, with excellent internal consistency and interrater reliability. Validity also appears good based on correlation with other symptom severity measures and factor analytic validation of the subscales. The PANSS has become the standard tool for assessing clinical outcome in treatment studies of schizophrenia and other psychotic


disorders and is sensitive to change with treatment. Its high reliability and good coverage of both positive and negative symptoms make it excellent for this purpose. It may also be useful for tracking severity in clinical practice, and its clear anchors make it easy to use in this setting. Scale for the Assessment of Positive Symptoms (SAPS) and Scale for the Assessment of Negative Symptoms (SANS) The SAPS and SANS were designed to provide a detailed assessment of positive and negative symptoms of schizophrenia and may be used separately or in tandem. The domains assessed include hallucinations, delusions, bizarre behavior, and thought disorder for the SAPS and affective flattening, poverty of speech, apathy, anhedonia, and inattentiveness for the SANS. Each instrument consists of 30 fully anchored items each scored 0 to 5; thus the total score ranges from 0 to 150 for each. Each must be rated by an experienced clinician and requires approximately 30 minutes to complete. Good-to-excellent interrater reliability exists if trained interviewers are used, and each scale has high internal consistency as well. Validity is supported by correlation with other symptom severity instruments. The SAPS and SANS are principally used to monitor treatment effects in clinical research and have also been used to help characterize positive and negative symptoms in studies of schizophrenia phenomenology. The comprehensive characterization of symptomatology provided might also be useful in clinical practice, but there is little experience in this area at the present time. Mood Disorders The domain of mood disorders includes both depressive and bipolar disorders. The issues for mania are similar to those for psychotic disorders, in that limited insight and agitation may hinder accurate symptom reporting, so clinician ratings including observational data are generally required. Rating depression, on the other hand, depends substantially on subjective assessment of mood states, so interviews and selfreport instruments are both common. Because depression is common in the general population and involves significant morbidity and even mortality, screening instruments, especially those using a self-report format, are potentially quite useful in primary care and community settings. Hamilton Rating Scale for Depression (HAM-D) The HAM-D was developed in the early 1960s to monitor the severity of major depression, with a focus on somatic symptomatology. The version in most common use has 17 items, although versions with different numbers of items, including the 24-item version (Table 7.8-8), have been used in many studies as well. Most versions do not include some of the symptoms used to diagnose depression in DSM-III and its successors, most notably increased sleep and increased appetite. Items on the HAM-D are scored 0 to 2 or 0 to 4, with total scores on the 17-item version ranging from 0 to 50: scores of 7 or less may be considered normal; 8 to 13, mild; 14 to 18, moderate; 19 to 22, severe; and 23 and above very severe. The HAM-D was designed for clinician raters but has been used by trained lay administrators as well. Ratings are completed by the examiner on the basis of patient interview and observations. A structured interview guide has been developed to improve reliability. The ratings can be completed in 15 to 20 minutes. Reliability is good to excellent, including internal consistency and interrater assessments. Validity appears good based on correlation with other depression symptom measures. The HAM-D has been used extensively to evaluate change in response to pharmacological and other interventions. It


is more problematic in elderly and medically ill persons, in whom somatic symptoms may not indicate major depression. Beck Depression Inventory (BDI) The BDI was developed in the early 1960s to rate depression severity, with a focus on behavioral and cognitive dimensions of depression. The current version, the Beck Depression Inventory–II (Beck-II), has added more coverage of somatic symptoms to be compatible with DSM-IV and covers the most recent 2 weeks. Earlier versions focus on the past week or even shorter intervals, which may be preferable for monitoring treatment response. The BDI includes 21 self-report items, each of which has four statements describing increasing levels of severity; the total score ranges from 0 to 84. Scores of 0 to 9 are considered minimal; 10 to 16, mild; 17 to 29, moderate; and 30 to 63, severe. The scale can be completed in 5 to 10 minutes. Internal consistency has been high in numerous studies. Test-retest reliability is not consistently high, but this may reflect changes in underlying symptoms. Validity is supported by correlation with other depression measures. The principal use of the BDI is as an outcome measure in clinical trials of interventions for major depression, including psychotherapeutic interventions. Because it is a self-report instrument, it is sometimes used to screen for major depression, for instance in medical outpatients. Various cutoffs have been suggested for a diagnosis of major depression, but even a cutoff of 9 has only fair sensitivity, at a cost of considerable nonspecificity, suggesting that the instrument has limited use for screening. The instrument's strength lies in measuring the depth of depression; it is not suitable for making a diagnosis. Zung Self-Rating Depression Scale The Zung scale was developed in the 1960s to provide a self-report measure of major depression with broad coverage of depression symptomatology. It has 20 items, each of which is scored from 1 to 4, based on the fraction of time in which it occurs. Half of the items are scored positively, and half negatively; positive items must be reversed to obtain a total severity score, which is then converted by formula to a scaled score. Scaled scores under 50 are considered normal; 50 to 59, minimal depression; 60 to 69, moderate depression; and over 70, severe depression. Most individuals can complete the Zung scale in 5 to 10 minutes. The Zung scale has been in use for many years but has not been extensively evaluated. However, reliability is good based on split-half and internal consistency studies. Validity also appears good based on correlations with other depression measures and the ability to discriminate between depressed and nondepressed outpatients. The Zung scale has been used to follow depressed patients in treatment studies; however, there is less variation in Zung scale scores than some other measures, which limits its utility as a change measure. It has also been used to screen for depression in medical outpatients or community interventions, including National Depression Screening Day, in which a threshold score of 50 was used to identify potential cases of depression requiring follow-up by a clinician. Young Mania Rating Scale (YMRS) The YMRS is a checklist developed in the late 1970s to provide a brief but thorough evaluation of the severity of mania that could be used to monitor treatment response or detect relapse. It consists of a checklist of 11 items rated either 0 to 4 (seven items) or 0 to 8 (four items). Each item has five item-specific anchors. The total score ranges from 0 to 60. Ratings include clinical observation, so it must be rated by a clinician, but reliable ratings have been obtained by nurses on inpatient units and beginning psychiatric residents. Reliability is good, based on interrater


reliability and internal consistency studies. Validity also appears good, based on correlation with other mania measures. The YMRS is useful for evaluating response to treatment in clinical research, and it is sensitive to change in this setting. It might also be used to assess treatment response or monitor for relapse in treated or untreated patients, although extensive experience with this use has not been reported. Anxiety Disorders The anxiety disorders addressed by the measures below include panic disorder, generalized anxiety disorder, and obsessive-compulsive disorder. When examining anxiety measures, one must be aware that their definitions have changed significantly over time. Both panic and obsessive-compulsive disorder are relatively recently recognized, and the conceptualization of generalized anxiety disorder has shifted over time. Thus, older measures have somewhat less relevance for diagnostic purposes, although they may identify symptoms causing considerable distress. Whether reported during an interview or on a self-report rating scale, virtually all measures in this domain, like the measures of depression discussed above, depend on subjective descriptions of inner states. Hamilton Rating Scale for Anxiety The HAM-A was developed in the late 1950s to assess anxiety symptoms, both somatic and cognitive. Because conceptualization of anxiety has changed considerably, the HAM-A provides limited coverage of “worry” required for DSM-IV diagnosis of generalized anxiety disorder and does not include the episodic anxiety found in panic disorder. There are 14 items, each of which is rated 0 to 4 on an unanchored severity scale, with the total score ranging from 0 to 56. A score of 14 has been suggested as the threshold for clinically significant anxiety, but scores of 5 or less are typical in individuals in the community. The scale is designed to be administered by a clinician, and formal training or the use of a structured interview guide is required to achieve high reliability. A computer-administered version is also available. Reliability is fairly good, based on internal consistency, interrater, and test-retest studies. However, given the lack of specific anchors, reliability should not be assumed to be high across different users in the absence of formal training. Validity appears good, based on correlation with other anxiety scales, but is limited by the coverage of domains critical to the modern understanding of anxiety disorders. Even so, the HAM-A has been used extensively to monitor treatment response in studies of generalized anxiety disorder and may also be useful for this purpose in clinical settings. Panic Disorder Severity Scale (PDSS) The PDSS is a recently developed, brief rating scale aimed at measuring the severity of panic disorder. It was based on the Yale-Brown Obsessive Compulsive Scale (YBOCS) and has seven items, each of which is rated on an item-specific five-point Likert scale. The seven items address frequency of attacks, distress associated with attacks, anticipatory anxiety, phobic avoidance, and impairment. The items are scored 0 to 4, and the total score ranges from 0 to 28. The instrument was designed for use by clinicians, but a patient-scored computerized version is in development. Reliability is excellent, based on interrater studies, but, in keeping with the small number of items and multiple dimensions, internal consistency is limited. Validity is supported by correlations with other anxiety measures, both at the total and item level, and lack of correlation with the HAM-D. Because the PDSS has been available for a fairly short period of time, there is limited experience with its use. However, it appears to be sensitive to change with treatment and is thus likely to prove useful as a change


measure in clinical trials or other outcome studies for panic disorder, as well as for monitoring panic disorder in clinical practice. Yale-Brown Obsessive Compulsive Scale (YBOCS) The YBOCS was developed in the late 1980s to measure the severity of symptoms in obsessive-compulsive disorder. It has 10 items rated on the basis of a semistructured interview. The first five items concern obsessions: the amount of time they consume, the degree to which they interfere with normal functioning, the distress they cause, the patient's attempts to resist them, and the patient's ability to control them. The remaining five items ask parallel questions about compulsions. Each item has a set of item-specific anchors scored 0 to 4, so total scores for obsessions and compulsions each range from 0 to 20, and overall total score ranges from 0 to 40. Typical scores for patients with obsessive-compulsive disorder are in the 16 to 30 range, and a threshold of 16 is typically used for inclusion in drug trials. The semistructured interview and ratings can be completed in 15 minutes or less. A selfadministered version has recently been developed and can be completed in 10 to 15 minutes. Computerized and telephone use also provide acceptable ratings. Prior to the first use of the YBOCS, an associated 64-item checklist is administered to provide a more detailed assessment of the specific content of the patient's obsessions and delusions. Reliability studies of the YBOCS show good internal consistency, interrater reliability, and test-retest reliability over a 1-week interval. Validity appears good, although data are fairly limited in this developing field. The YBOCS has become the standard instrument for assessing obsessive-compulsive disorder severity and is used in virtually every drug trial. It may also be used clinically to monitor treatment response. Substance Use Disorders Substance use disorders include both abuse and dependence on both alcohol and drugs. These disorders, particularly those involving alcohol, are common and debilitating in the general population, so screening instruments are particularly helpful. Because these behaviors are socially undesirable, underreporting of symptoms is a significant problem. Validation against drug tests or other measures is of great value, particularly when working with patients who have known substance abuse. CAGE The CAGE was developed in the mid-1970s to serve as a very brief screen for significant alcohol problems in a variety of settings, which could then be followed up by clinical inquiry. CAGE is an acronym for the four questions that make up the instrument: (1) Have you ever felt you should cut down on your drinking?; (2) Have people annoyed you by criticizing your drinking?; (3) Have you ever felt bad or guilty about your drinking?; (4) Have you ever had a drink first thing in the morning to steady your nerves or to get rid of a hangover (eye-opener)? Each “yes” answer is scored as 1, and these are summed to generate a total score. Scores of 1 or more warrant follow-up, and scores of 2 or more strongly suggest significant alcohol problems. The instrument can be administered in a minute or less either orally or on paper. Reliability has not been formally assessed. Validity has been assessed against a clinical diagnosis of alcohol abuse or dependence, and these four questions perform surprisingly well. Using a threshold score of 1, the CAGE achieves excellent sensitivity and fair to good specificity. A threshold of 2 provides still greater specificity, but at the cost of a fall in sensitivity. The CAGE performs well as an extremely brief screening instrument for use in primary care or in psychiatric practice focused on problems unrelated to alcohol. However, it has


limited ability to pick up early indicators of problem drinking that might be the focus of preventive efforts. Alcohol Use Disorders Identification Test (AUDIT) The AUDIT was developed by the World Health Organization in the late 1980s as a brief screening instrument designed for the early detection of hazardous (i.e., involving the risk of harm) and harmful (i.e., involving the presence of harm) alcohol use in a variety of settings. It focuses on both the past year and current drinking. It includes a 10-item core screening instrument covering alcohol consumption, drinking behaviors, and alcohol-related problems. Each item is rated using item-specific anchors scored 0 to 4 and summed for a total score of 0 to 40. The AUDIT can be administered and scored in less than 5 minutes and does not require professional training. The AUDIT also offers a Clinical Screening Procedure involving a physical examination and blood tests, which adds no more than 5 to 10 minutes to a routine medical examination. Reliability of the AUDIT appears good, based on internal consistency data. Validity judged against a clinical diagnosis of alcoholism is also good; using a threshold score of 8 is quite sensitive but somewhat nonspecific. A score of 10 is more specific but at a cost in specificity. Validity also appears good, based on correlation with other alcohol self-report measures and risk factors for alcoholism. The AUDIT provides an excellent brief screen for alcohol problems and is particularly good for detecting problem drinking at a fairly mild stage. However, its focus on early detection of hazardous and harmful drinking makes it less suited for use as a diagnostic instrument. Drug Abuse Screening Test (DAST) The DAST was developed in the early 1980s to serve as a screening and assessment instrument for drug abuse. The DAST is an adaptation of the Michigan Alcohol Screening Test (MAST), used to screen for alcoholism. It focuses on lifetime drug use, so it is not designed to measure changes over time. The current version of the DAST has 20 items, all of which are answered “yes” or “no,” and can be given orally or as a paper and pencil questionnaire. An earlier version had 28 items, so the 20-item version is sometimes called the Brief DAST. The positive items can be summed to form a 20-point scale. The DAST can be administered and scored in less than 10 minutes. Reliability is very good, based on internal consistency. Validity based on ability to detect drug abuse disorder also appears high, with excellent sensitivity and fairly good specificity using a threshold score of 5. The DAST is useful as a screening device for drug abuse problems in patients with other mental disorders, particularly alcohol abuse. It also provides an overview of problem severity that may be useful in guiding treatment choices. Addiction Severity Index (ASI) The ASI was developed in the early 1980s to serve as a quantitative measure of symptoms and functional impairment caused by alcohol or drug disorders. It covers demographics, alcohol use, drug use, psychiatric status, medical status, employment, legal status, and family and social issues. Frequency, duration, and severity are assessed. It has 142 items in varying formats including yes-no, multiple choice, and scaled items. They include both subjective and objective items reported by the patient and observations made by the interviewer. In each area, the ASI yields the rater's global assessment of severity along with a computer score on a 0 to 1 scale. The 142-item version includes information on the past 30 days and lifetime status, but a shorter version is available for use at follow-up. The instrument is designed for clinician administration but has been used successfully by trained lay raters. Training is


recommended, and both manuals and formal training programs are available. A computerized version is also available. The standard ASI requires 45 to 75 minutes to complete, but the follow-up version can be completed in 15 to 20 minutes. Very good to excellent reliability has been demonstrated for the overall composite score, with somewhat lower reliability for severity ratings in each area. Validity has also been demonstrated, based on correlation with other measures and discrimination of patient and nonpatient populations. Normative data are available for a range of populations of alcohol and drug abusers, including alcohol clinic patients, drug abusers, homeless persons, and prisoners. The principal use of the ASI is as an aid to treatment planning and the assessment of treatment outcome in clinical and law enforcement settings. It is relatively time consuming to administer but performs well for this purpose. It is also used in clinical research as a sensitive indicator of baseline severity and change over time, which allows comparison between clinical research and clinical practice. Eating Disorders Eating disorders include anorexia nervosa, bulimia nervosa, and bingeeating disorder, which is included in Appendix B of DSM-IV and is gaining acceptance among eating-disorders clinicians and researchers. A wide variety of instruments, particularly self-report scales, are available. Because of the secrecy that may surround dieting, bingeing, purging, and other symptoms, validation against other indicators (e.g., body weight for anorexia, dental examination for bulimia) may be very helpful. Such validation is particularly critical for patients with anorexia, who may lack insight into their difficulties. Eating Disorders Inventory (EDI) The EDI was developed in the early 1980s to provide a multidimensional self-report assessment of eating disorder symptomatology and related psychological attributes. The current version of the EDI, the EDI-2, has 91 items in 11 subscales: Drive for Thinness, Bulimia, Body Dissatisfaction, Ineffectiveness, Perfectionism, Interpersonal Distrust, Interoceptive Awareness, Maturity Fears, Asceticism, Impulsive Regulation, and Social Insecurity. Each item is rated on a six-point frequency-based Likert scale. The two ratings at the symptomatic end of the scale are scored 2; the middle two, scored 1; and the last two, scored 0; then the items in each subscale are added to generate a subscale score. Instrumentwide total scores may also be obtained but are not considered meaningful. Norms for each subscale are available for a variety of eating-disordered and nonclinical populations. The EDI can be completed in less than 20 minutes. Easy-reading and childhood versions are available. A computerized version is also available. Reliability data indicate very good internal consistency and testretest reliability for virtually all EDI subscales. Validity of the EDI subscales is supported by correlation with related eating disorder measures and discrimination between patient and nonpatient samples. The EDI subscales also correlate moderately with ratings of these domains by trained clinicians. The EDI has several uses that apply in both clinical and research practice. Its principal use is in providing a range of data that may help in treatment planning. For instance, body dissatisfaction is an important predictor of prognosis and treatment response in bulimia nervosa. Some of the subscales, particularly the Bulimia scale, have also been shown to be sensitive to change with treatment and thus may be used to monitor patients over time. The EDI has also been used for screening purposes in primary care or other settings to identify individuals at high risk for eating disorders. The Drive for Thinness, Bulimia, and Body Dissatisfaction scales are probably the most useful in this regard. For instance, scores of 14 or above on the Drive for


Thinness scale suggest an increased risk for anorexia nervosa and warrant further evaluation. Bulimia Test-Revised (BULIT-R) The BULIT-R was developed in the mid-1980s to provide a categorical and continuous assessment of bulimia nervosa. The current version, while designed for DSM-III-R criteria, has been validated for DSM-IV as well. The BULIT-R has 36 self-report items, each scored on an item-specific five-point Likert scale. Of these items, 8 provide descriptive information, and the remaining 28 are summed to provide the total score, which ranges from 28 to 140. Young women with bulimia nervosa typically score above 110, while young women without disordered eating typically score below 60. The instrument can be completed in about 10 minutes. The BULIT-R shows high reliability, based on studies of internal consistency and testretest reliability in multiple studies. Validity is supported by high correlations with other bulimia assessments. The recommended cutoff of 104 suggested to identify probable cases of bulimia shows high sensitivity and specificity for a clinical diagnosis of bulimia nervosa. Using cutoffs between 98 and 104, the BULIT-R has been used successfully to screen for bulimia nervosa. As with any screening procedure, follow-up by clinical examination is indicated for individuals scoring positive; clinical follow-up is particularly critical because the BULIT-R does not distinguish clearly between different types of eating disorders. The BULIT-R may also be useful to track symptoms over time or in response to treatment, in both clinical and research practice, although more detailed measures of the frequency and severity of bingeing and purging may be preferable in research settings. Cognitive Disorders Dementia is becoming an increasing focus of psychiatric practice. A wide variety of measures are available. Most involve cognitive testing and provide objective, quantifiable data. However, scores vary by educational level in subjects without dementia, so these instruments tend to be most useful when the patient's baseline score is known. Other measures focus on functional status, which can be assessed on the basis of a comparison with a description of the subject's baseline function; these types of measures generally require a knowledgeable informant and may thus be more cumbersome to administer, but they tend to be less subject to educational biases. Mini-Mental State Examination The MMSE is a 30-point cognitive test developed in the mid-1970s to provide a bedside assessment of a broad array of cognitive functions including orientation, attention, memory, construction, and language. It can be administered in under 10 minutes by a busy doctor or a technician and scored rapidly by hand. The MMSE has been extensively studied and shows excellent reliability. Validity appears good, based on correlations with a wide variety of more comprehensive measures of mental functioning and clinicopathological correlations. One common use of the MMSE is in screening for dementia, in both office practice and epidemiological or clinical research. For this purpose, a cutoff of 24 for identifying cases of dementia has been suggested, but it is probably more accurate to use age- and education-adjusted norms to interpret the results. For patients with extensive education, who may score 30 out of 30 despite clear evidence of functional decline, a more difficult cognitive test, full neuropsychological battery, or clinical interview may be required to detect dementia. The other principal use of the MMSE is in following the progression of dementia over time. As a rule of thumb, mild dementia ranges from a score of 20 to 24, moderate from 11 to 19, and severe from 0 to 10. However, these figures do not take the educational


differences noted above into account. In addition, the MMSE does not do as well tracking progression of dementia in the lower ranges, as many patients become untestable. Blessed Information Memory Concentration Test (IMC) The IMC, sometimes called the Blessed IMC after its developer, was developed in the late 1960s for studies of the relationship between dementia severity and neuropathological changes. The original version of the scale, developed in Britain, had 29 items, the current American version has 26. Areas assessed include information (date, time, place, name, age, remote personal information, dates of the world wars, name of governmental leaders), memory (a name and address for 5-minute recall), and concentration (counting forward and backward from 1 to 20). A six-item version, sometimes called the short form of the Blessed or the Orientation Memory Concentration (OMC) test, is also available: it asks only the time of day, month, year, the 5-minute recall of the address, and counting backward from 20 and is highly predictive of total IMC score. IMC scores on individual items are weighted; on the 26-item IMC they range from 0 (no errors) to 33 and on the six-item version, scores range from 0 to 28. The IMC can be administered in person or over the phone by a trained clinician or lay rater. Based on internal consistency and test-retest studies in demented subjects over a 1- to 6-week interval, reliability of the 26-item and 6-item IMC both appear very good. Validity also appears good, based on clinicopathological correlations and correlations with other dementia severity measures. Studies of changes with time in patients with Alzheimer's disease show an average annual increase of 3 to 4 points on the 26-item version and 2.5 points on the 6-item version. The principal use of the IMC is assessing dementia severity over time, either through the natural course of the disease or in response to treatment interventions. The IMC is also sometimes used as a screening instrument in clinical practice or community research studies. A cutoff score of 10 has been recommended for the 6-item version, but no standard cutoff is recommended for the 26-item version. In any case, very limited population data exist, and norms are not available by age or education, which are very likely to affect test results. Because of its focus on memory items, the IMC may perform better as a severity or screening measure in patients with Alzheimer's disease than in those with other dementing illnesses. Global Deterioration Scale (GDS), Brief Cognitive Rating Scale (BCRS), and Functional Assessment and Staging Tool (FAST) The GDS, BCRS, and FAST are a group of measures designed to provide ordinal staging of cognitive and functional status in patients with dementia, particularly those with Alzheimer's disease. The three instruments use a consistent seven-point scale. The GDS is a simple rating scale that describes seven stages from normal aging to severe dementia: (1) normal, (2) subjective complaints only, (3) subtle deficits with little or no functional decline except in very demanding tasks (e.g., managerial tasks at work, or preparing an elaborate social event like a holiday meal), (4) definitive deficits that interfere with complex activities of daily living (ADLs) (e.g., balancing a checkbook), (5) deficits that interfere with independent living in the community, (6) deficits that interfere with basic ADLs (e.g., dressing, toileting), and (7) profound deficits leading to the need for continuous assistance. The BCRS focuses on cognitive issues and describes the same seven levels in five different domains referred to as axes: (I) concentration, (II) recent memory, (III) remote memory, (IV) orientation, and (V) self-care. The FAST focuses on functional status, again in seven stages, but it adds substages within stages 6 and 7. All three scales should be completed


by an individual with clinical experience (physician, psychologist, nurse, or trained technician) after a review of all available information from the patient, informants, and medical records. The FAST can generally be completed in as little as 10 to 15 minutes, but the GDS and especially the BCRS may require 30 to 45 minutes. Reliability of these scales is excellent, based on interrater and test-retest studies. Validity of all three measures is supported by correlations with other cognitive and functional status scales. GDS and BCRS stages have also been validated against neuropathological data. FAST stages correspond closely to typical progression in Alzheimer's disease. The GDS, BCRS, and FAST are useful in staging dementia, especially Alzheimer's disease, which is much more likely to follow the described ordinal stages closely. Such staging may be used to provide a concise description for patients referred to other clinicians or settings or to track changes over time or in response to treatment. The FAST is especially useful in staging severe dementia and has been used extensively to assess the need for services. The GDS and BCRS are both sensitive to change, with average declines of approximately 0.5 GDS or BCRS subscale points per year in Alzheimer's disease patients. Personality Disorders and Personality Traits Personality may be conceptualized categorically as personality disorders or dimensionally as personality traits, which may be viewed as normal or pathological. The focus here is on personality disorders, and the traits are generally viewed as their milder forms. DSM-IV defines 10 personality disorders in three clusters, and an additional two disorders (passive-aggressive and depressive personality) are proposed in Appendix B for further study. Patients tend not to fall neatly into DSM-IV personality categories; instead, most patients who meet criteria for one personality disorder also meet criteria for one or more others, particularly within the same cluster. This and other limitations in the validity of the constructs themselves makes it difficult to achieve validity in personality measures. Personality measures include both interviews and self-report instruments. Self-report measures are appealing in that they require less time and may appear less threatening to the patient. However, they tend to overdiagnose personality disorders. Because many of the symptoms suggesting personality problems are socially undesirable, and because patients' insight tends to be limited, clinician-administered instruments, which allow for probing and patient observation, may provide more accurate data. Structured Clinical Interview for DSM-IV Axis-II Personality Disorders (SCID-II) The SCID-II is the counterpart of the SCID for making DSM diagnoses of personality disorders. The initial version was developed for DSM-III-R in the mid-1990s, and the current version makes diagnoses according to DSM-IV. The SCID-II is organized by disorder and includes all 10 DSM-IV personality disorders plus the two proposed in Appendix B. A 119-item self-report screening questionnaire is generally given first to eliminate sections not needing further exploration: each of the items corresponds to a specific criterion for a DSM-IV personality disorder. The SCID-II proper includes one or two yes-no items for each criterion, with each affirmative answer to be followed-up by examples from the person's life. Based on these answers each criterion is scored 1 for false, 2 for subthreshold, and 3 for present, allowing criteria scores to be summed for a dimensional measure of each disorder or combined following the DSM-IV diagnostic rules for a categorical approach. The screening questionnaire can be completed by the patient in about 20 minutes; the interview generally requires about an hour. The SCID-II must be administered by doctoral-level clinicians, and training in the SCID-II is also


required. A computerized administration and scoring program is available. Reliability is good for the presence or absence of any disorder but only fair for specific personality disorders; the reliability of dimensional assessment is somewhat better. Validity is somewhat harder to determine, as agreement with clinician assessment tends to be modest, but given its comprehensiveness and strict adherence to DSM-IV criteria, the SCID-II may actually be more valid. The SCID-II is most useful to provide a standardized, comprehensive assessment of personality disorders, whether in research, forensic, or clinical settings. Personality Disorder Questionnaire (PDQ) The PDQ was developed in the late 1980s as a self-report questionnaire designed to provide categorical and dimensional assessment of DSM-III-R personality disorders and was subsequently revised for DSM-IV. An alternate version includes the two disorders in Appendix B as well. Another alternate version allows for ratings within the last few weeks and is designed to serve as a change measure. The current PDQ, the PDQ-IV, includes 85 yes-no items, designed primarily to assess the diagnostic criteria for DSM-IV personality disorders. Within the 85 items are embedded two validity scales to identify underreporting, lying, or inattention. There is also a brief clinician-administered Clinical Significance Scale to address the impact of any personality disorder identified by the self-report PDQ. The PDQ can provide categorical diagnoses with a scaled score for each or an overall index of personality disturbance based on the sum of all the diagnostic criteria. Overall scores range from 0 to 79; patients with personality disorders generally score above 30, psychotherapy outpatients without such disorders tend to score in the 20-to-30 range, and normal controls tend to score below 20. The PDQ can be completed in under 30 minutes. Computerized administration and scoring are available. Reliability is fair to good for dimensional assessment and quite variable for categorical assessment, with good reliability for obsessive-compulsive and antisocial personality, and inadequate reliability for many disorders. Validity judged against semistructured clinician-administered interviews is also variable. The PDQ, like other self-report instruments, tends to overdiagnose personality disorders, with many false positives and few false negatives. Its brevity, excellent sensitivity, and poor specificity make it most useful as a screening device, with a follow-up semistructured interview for patients screening positive. Childhood Disorders A wide variety of instruments is available to assess mental disorders in children. Despite a rich array of instruments, the evaluation of children remains difficult for several reasons. First, the child psychiatric nosology is at an earlier stage of development, and construct validity is often problematic. Multiple changes in diagnostic criteria from DSM-III to DSM-III-R to DSM-IV complicate the choice of measures. Second, because children change markedly with age, it is virtually impossible to design a measure that covers children of all ages. Finally, because children, particularly young children, have limited ability to report their symptoms, other informants are necessary. This often creates problems because child, parent, and teacher reports of symptoms frequently disagree, and the optimal way to combine information is unclear. Child Behavior Checklist (CBCL) The CBCL is a family of self-rated instruments that survey a broad range of difficulties encountered in children from preschool age through adolescence. One version of the CBCL, designed for completion by parents of children aged 4 to 18, is shown in Table 7.8-13. Another version is available for parents of children ages 2 to 3. The Youth Self Report is completed by children ages 11 to 18, and


the Teacher Report Form is completed by teachers regarding school age children. The scale includes not only problem behaviors, but academic and social strengths as well. Each version includes approximately 100 items scored on a 3-point Likert scale. Scoring can be done by hand or computer, and normative data are available for each of the three subscales: problem behaviors, academic functioning, and adaptive behaviors. A computerized version is also available. The CBCL does not generate diagnoses, but instead suggests cutoff scores for problems in the “clinical range.” Parent, teacher, and child versions each show high reliability on the problem subscale, but the three informants frequently do not agree with one another. The CBCL may be useful in clinical settings as an adjunct to clinical evaluation: they provide a good overall view of symptomatology and may also be used to track change over time. They are used frequently for similar purposes in research involving children and thus can be compared with clinical experience. The instrument does not, however, provide diagnostic information, and its length limits its efficiency for tracking purposes. Diagnostic Interview Schedule for Children (DISC) The DISC was originally developed in the early 1990s as a fully structured diagnostic interview for making DSM-III diagnoses in children. It has since been revised for DSM-III-R and DSM-IV. The current DISC, the DISC-IV, covers a broad range of DSM-IV diagnoses, both current and lifetime. It has nearly 3000 questions but is structured with a series of stem questions that serve as gateways to each diagnostic area, with the remainder of each section skipped if the subject answers “no.” Subjects who enter each section have very few skips, so both symptom scales and diagnostic information can be obtained. Child, parent, and teacher versions are available. Computer programs are available to implement diagnostic criteria, to generate severity scales based on each version, or to combine parent and child information. A typical DISC interview may take more than an hour for a child, plus an additional hour for a parent, but because of the stem question structure, the actual time varies widely with the number of symptoms endorsed. The DISC was designed for lay interviewers. It is fairly complicated to administer, and formal training programs are highly recommended. Reliability of the DISC is only fair to good and generally is better for the combined child and parent interview. Validity judged against a clinical interview by a child psychiatrist is also fair to good, better for some diagnoses and better for the combined interview. The DISC is well tolerated by parents and children and can be used to supplement a clinical interview to ensure comprehensive diagnostic coverage. Because of its inflexibility, some clinicians find it uncomfortable to use, and its length makes it less than optimal for use in clinical practice. However, it is used frequently in a variety of research settings. Children's Depression Inventory (CDI) The CDI is a 27-item self-report measure of mood symptoms in children aged 7 to 17. A 10-item screening version is also available. The instrument may be administered as a simple pen-and-pencil test using a special easily scored form or by computer. The CDI has a first-grade reading level, so even young children can generally complete it on their own. Good reliability has been demonstrated, and good validity is suggested by its ability to distinguish currently depressed children from those with partially remitted depression or with other psychiatric disorders, along with its correlation with other measures of childhood depression. The principal uses of the CDI are in screening for depression in psychiatric patients or epidemiological


surveys: it works fairly well for this purpose but tends to miss a substantial fraction of cases because children are not always good reporters. The other use is as a change measure in office practice or clinical trials. It appears to have adequate sensitivity to change for this purpose but has not been subjected to extensive evaluation. Conners Rating Scales The Conners Rating Scales are a family of instruments designed to measure a range of childhood and adolescent psychopathology, but are most commonly used in the assessment of attention-deficit/hyperactivity disorder. There are teacher, parent, and self-report (for adolescents) versions and both short (as few as 10 items) and long (as many as 80 items, with multiple subscales) forms. Extensive normative data drawn from an ethnically diverse population are available for each sex across a broad age range. Even the longer forms can be completed in 15 to 20 minutes, and scoring can be accomplished rapidly. Rater training is not required. Reliability data are excellent for the Conners Rating Scales. However, teacher and parent versions tend to show poor agreement. Validity data suggest that the Conners Rating Scales are excellent at discriminating attention-deficit/hyperactivity disorder patients and normal controls. It has more difficulty separating attention-deficit/hyperactivity disorder from other disruptive behavioral disorders such as conduct disorder, but this may substantially relate to the genuine clinical difficulties separating these syndromes. Newer versions of the Conners have been developed that aim to improve these discriminations, but they have not yet been subjected to extensive testing. The principal uses of the Conners Rating Scales are in screening for attention-deficit/hyperactivity disorder in school or clinic populations and following changes in symptom severity over time; sensitivity to change in response to specific therapies has been demonstrated for most versions of the Conners.


