You are on page 1of 154

THE FLORIDA STATE UNIVERSITY SCHOOL OF VISUAL ARTS AND DANCE

A SYSTEMATIC ANALYSIS OF ART THERAPY ASSESSMENT AND RATING INSTRUMENT LITERATURE

By DONNA J. BETTS

A Dissertation submitted to the Department of Art Education in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Degree Awarded Spring Semester, 2005

Copyright 2005 Donna J. Betts All Rights Reserved

The members of the Committee approve the dissertation of Donna J. Betts defended on April 11, 2005.

Marcia L. Rosal Professor Directing Dissertation

Susan Carol Losh Outside Committee Member

David E. Gussak Committee Member

Penelope Orr Committee Member Approved: Marcia L. Rosal, Chair, Art Education Department

Sally McRorie, Dean, School of Visual Arts and Dance

The Office of Graduate Studies has verified and approved the named committee members.

ii

One must not always think that feeling is everything. Art is nothing without form. Gustave Flaubert

iii

For my parents David and Wendy Betts

iv

ACKNOWLEDGEMENTS

Henry Adams said, A teacher affects eternity; he can never tell where his influence stops. Thank you, Dr. Marcia Rosal, for your mentorship and the expertise you provided as the chair of my doctoral committee. I am profoundly grateful. Deep appreciation is extended to my doctoral committee members, Dr. David Gussak, Dr. Susan Losh, and Dr. Penelope Orr for their collective wisdom and guidance. Thank you to my statistician, Qiu Wang, for many hours spent analyzing data and educating me about metaanalysis procedures. Thanks also to mentor Dr. Betsy Becker. The assistance provided by art therapist Carolyn Brown as a coder of primary studies is very much appreciated. Gratitude is expressed to the colleagues, mentors, and friends who offered their support and guidance: Anne Mills, Barry Cohen, Dr. Linda Gantt, Dr. Sue Hacking, and Dr. Tom Anderson. To the friends and family members who provided me with continuous motivation through the course of doctoral study, especially Evelyn Moore, Annette Moll, and Jeffrey Gray thank you. I am grateful for the research grant bestowed upon me by the Florida State University Office of Graduate Studies.

TABLE OF CONTENTS

List of Tables List of Figures Abstract 1. INTRODUCTION Problem to be Investigated Purpose of the Study Justification of the Study Meta-Analysis Techniques Research Questions and Assumptions Brief Overview of Study Definition of Terms Conclusion 2. LITERATURE REVIEW

ix x xi 1 3 3 3 5 7 8 8 14 15

Historical Foundations and Development of Art Therapy Assessment Instruments 15 Criticisms of the Original Assessment Tools Development of Art Therapy Assessments and Rating Instruments Art therapy Assessments Rating Instruments Why Art Therapy Assessments Are Important Issues With Art Therapy Assessments Practical Problems Philosophical and Theoretical Issues 16 20 20 21 24 26 26 29

vi

Avenues For Improving Art Therapy Assessments Computer Technology and the Advancement of Assessment Methods Analyzing The Research on Art Therapy Assessment Instruments Syntheses of Psychological Assessment Research Syntheses of Creative Arts Therapies Assessment Research Conclusion 3. METHODOLOGY Approach for Conducting the Systematic Analysis Problem Formulation Criteria for Selection of Art Therapy Assessment Studies The Literature Search Stage Locating Studies Extracting Data and Coding Study Characteristics The Research Synthesis Coding Sheet Categorizing Research Methods Testing of the Coding Sheet Data Evaluation Identifying Independent Comparisons Data Analysis and Interpretation 4. RESULTS Descriptive Results Citation Dates Citation Types Art Therapy Assessment Types Patient Group Categories Rater Tallies Inter-Rater Reliability Meta-Analysis of Inter-Rater Reliability Examination of Potential Mediators

31 34 35 35 38 41 43 43 43 44 45 45 46 46 47 47 48 48 49 50 50 50 51 52 53 53 54 56 57

vii

Concurrent Validity Meta-Analysis of Concurrent Validity Examination of Potential Mediators Conclusion 5. CONCLUSIONS Research Questions Question One Methodological Problems Identified in the Primary Studies Problems with Validity and Reliability of the Articles Researched Methods for Improvement of Tools Question Two Question Three Limitations of the Study Validity Issues in Study Retrieval Issues in Problem Formulation Judging Research Quality Issues in Coding Other sources of Unreliability in Coding Validity Issues in Data Analysis Issues in Study Results Recommendations for Further Study Conclusion

59 59 61 62 63 63 63 64 64 68 72 72 73 73 74 74 74 74 74 75 75 76

APPENDICES REFERENCES BIOGRAPHICAL SKETCH

78 126 142

viii

LIST OF TABLES

1. 2. 3. 4. 5. 6. 7. 8. 9.

Studies Reporting Percent Agreement (Hacking 1999) Patient Group Categories Number of Raters Inter-Rater Reliability Kappa Effect Size Percentage Effect Size Potential Mediating Variables (Inter-Rater Reliability) Concurrent Validity Frequencies Concurrent Validity Effect Sizes

28 53 54 55 57 57 59 59 60 61 67 68 68 71

10. Potential Mediating Variables (Concurrent Validity) 11. Author/Coder Favor Tally 12. Author/Coder Favor Frequencies 13. Sampling Flaws 14. Rating Systems Needing Revision

ix

LIST OF FIGURES

1. 2. 3. 4.

Citation Date Histogram Citation Type Art Therapy Assessment/Rating Instrument Type Author-Identified Study Weaknesses

51 52 52 65

ABSTRACT

Art-based assessment instruments are used by many art therapists to: determine a clients level of functioning; formulate treatment objectives; assess a clients strengths; gain a deeper understanding of a clients presenting problems; and evaluate client progress. To ensure the appropriate use of drawing tests, evaluation of instrument validity and reliability is imperative. Thirty- five published and unpublished quantitative studies related to art therapy assessments and rating instruments were systematically analyzed. The tools examined in the analysis are: A Favorite Kind of Day (AFKOD); the Birds Nest Drawing (BND); the Bridge Drawing; the Diagnostic Drawing Series (DDS), the Child Diagnostic Drawing Series (CDDS); and the Person Picking an Apple from a Tree (PPAT). Rating instruments are also investigated, including the Descriptive Assessment of Psychiatric Art (DAPA), the DDS Rating Guide and Drawing Analysis Form (DAF), and the Formal Elements Art Therapy Scale (FEATS). Descriptive results and synthesis outcomes reveal that art therapists are still in a nascent stage of understanding assessments and rating instruments, that flaws in the art therapy assessment and rating instrument literature research are numerous, and that much work has yet to be done. The null hypothesis, that homogeneity exists among the study variables identified in art therapy assessment and rating instrument literature, was rejected. Variability of the concurrent validity and inter-rater reliability meta-analyses results indicates that the field of art therapy has not yet produced sufficient research in the area of assessments and rating instruments to determine whether art therapy assessments can provide enough information about clients or measure the process of cha nge that a client may experience in therapy. Based on a review of the literature, it was determined that the most effective approach to assessment incorporates objective measures such as standardized assessment procedures (formalized assessment tools and rating manuals; portfolio evaluation; behavioral checklists), as

xi

well as subjective approaches such as the clients interpretation of his or her artwork. Due to the inconclusive results of the present study, it is recommended that researchers continue to explore both objective and subjective approaches to assessment.

xii

CHAPTER 1 INTRODUCTION A wide variety of tests are available for the purpose of evaluating individuals with cognitive, developmental, psychological, and/or behavioral disorders. Broadly defined, a test is: a set of tasks designed to elicit or a scale to describe examinee behavior in a specified domain, or a system for collecting samples of an individuals work in a particular area. Coupled with the device is a scoring procedure that enables the examiner to quantify, evaluate, and interpretbehavior or work samples. (American Educational Research Association [AERA], 1999, p. 25) The Buros Institute of Mental Measurements, an authority on various assessment instruments and publishers of the Mental Measurements Yearbook and Tests in Print, publishes critical reviews of commercially available tests (Buros Institute, Test Reviews Online, n.d.[a]). They provide an online assessment database that lists hundreds of tests in the following categories: achievement, behavior assessment, developmental, education, English and language, fine arts, foreign languages, intelligence and general, aptitude, mathematics, miscellaneous, neuropsychological, personality, reading, science, sensory-motor, social studies, speech and hearing, and vocations. In the category of personality tests, roughly 770 instruments are listed. These are defined as Tests that measure individuals ways of thinking, behaving, and functioning within family and society and include: anxiety/depression scales, projective and apperception tests, needs inventories; tests assessing risk-taking behavior, general mental health, self- image/-concept/-esteem, empathy, suicidal ideation, emotional intelligence, depressio n/hopelessness, schizophrenia, abuse, coping skills/stress, grief, decision- making, racial attitudes, eating disorders, substance use/abuse (or propensity for abuse); general motivation, perceptions, attributions; parenting styles, marital issues/satisfaction, and adjustment (Buros Institute, Test Reviews Online, n.d.[b]).

These personality instruments are typically administered by psychologists, social workers, counselors, and other mental health professionals. Art therapists, masters- or PhD- level mental health practitioners, are also often expected to use assessment tools for client evaluation. Art therapists most often use instruments that are known in the field as art-based assessments or art therapy assessments. These two terms are used interchangeably throughout this manuscript, as are the words assessment, instrument, test and tool. According to the American Art Therapy Association (2004a), assessment is the use of any combination of verbal, written, and art tasks chosen by the professional art therapist to assess the individuals level of functioning, problem areas, strengths, and treatment objectives. Art therapy assessments can be directed and/or non-directed, and can include drawings, paintings, and/or sculptures (Arrington, 1992). However, for the purposes of this dissertation, an art therapy assessment instrument is an objective, standardized measure designed by an art therapist (as opposed to a psychologist), and incorporates drawing materials (as opposed to paint, clay, or other media). Referred to by some as projective techniques (Brooke, 1996), art therapy assessments are alluring with their ability to illustrate concrete markers of the inner psyche (Oster & Gould Crone, 2004, p. 1). The most practical art therapy assessments are easy to administer, take a reasonably brief amount of time to complete, are non-threatening for the client, and are easily interpreted (Anderson, 2001). An assessment is most useful when the art therapist has solid training in its administration (Hagood, 2002), and when, over time and with systematic study, he or she achieves mastery of the technique (Kinget, 1958). An art-based instrument is only as good as the system used to rate it. A rating manual that accompanies an assessment should be illustrated and should link the scores to the examples and the instructions to the rater (Gantt, 2004). Dozens of art-based tools exist, and they are used with a variety of client populations in different ways (refer to Appendix A for a comprehensive list of art therapy assessments and rating instruments; Appendix B for a description of two of the most well-known art therapy tools; and Appendix C for a description of how an art therapy assessment is developed). In the American Art Therapy Association Education Standards document (AATA, 2004b), art therapy assessment is listed as a required content area for graduate art therapy

programs. The most recent membership survey of the American Art Therapy Association (Elkins, Stovall, & Malchiodi, 2003) reported that 31% of respondents provide assessment, evaluation, and testing services as a part of their regular professional tasks. This demographic represents nearly one-third of art therapists who completed the survey although this ratio is not staggering, it nevertheless indicates that assessment is a fairly important task in art therapy practice. Problems with art therapy assessment tools are plentiful, ranging from validity and reliability issues, to a lack of sufficient training in assessment use and limited understanding of the merits and the drawbacks of different instruments. Because of the relatively common use of art-based tests, there is cause for concern. Problem to be Investigated Purpose of the Study A major problem in the field of art therapy is that many art therapists are using assessment instruments and are doing so without knowing how applicable these tools are, and without understanding the implications of poor validity and lack of reliability. Therefore, the purpose of this study was to examine the existing literature related to art therapy assessments and rating instruments with the goal of presenting a systematic analysis of these methods. Justification of the Study The relatively common use of art therapy assessment instruments addresses the demand to demonstrate client progress, and can therefore help to improve the quality of treatment that a client receives (Deaver, 2002; Gantt & Tabone, 2001). Despite the various uses and apparent benefits of art therapy assessments, however, there are some problems with the ways in which these tools have been developed and validated. Much of the research is small in scale. Many studies with children have used matched groups, but have failed to control for age differences (Hagood, 2002). Hacking (1999) found that the literature is poorly cumulated, i.e., that there is a lack of orderly development building directly on the older work (p. 166). Many studies conclude with unacceptable results and call for further research. Furthermore, some researchers do not publicly admit to flaws in their work (Hacking, 1999), continue to train others to use their assessments despite these flaws, and are reluctant to share the details of their research to facilitate improvement in the quality of assessments and rating scales.

Other problems are more theoretical or philosophical: for instance, it has been argued that art-based instruments are counter-therapeutic and even exploit clients. In addition, many practitioners believe that the use of art-based assessments is depersonalizing, as these tools typically fail to incorporate subjective elements such as the clients own verbal account of his or her artwork. Burt (1996) even criticized quantitative approaches to art therapy assessment research as being gender biased, asserting that the push for quantitative research excludes feminist approaches. Because of the numerous merits and drawbacks of art therapy assessment instruments, this topic is hotly debated among the most prominent art therapists. A recent issue of Art Therapy: Journal of the American Art Therapy Association is dedicated to this topic and includes: Linda Gantts special feature The Case for Formal Art Therapy Assessments (2004, pp. 18-29); a commentary by Scottish art therapist Maralynn Hagood (2004, p. 3); and the Journals associate editor Harriet Wadesons response to Gantts feature in the same issue (2004, pp. 3-4). In her opening commentary of another recent issue of Art Therapy, the associate editor stated that the articles contained therein reflect aspects of the controversy that has surrounded art-based assessments in recent years (Wadeson, 2003, p. 63). At the foundation of this controversy are those who support the use of art therapy assessments and those who challenge it. In 2001, a panel of prominent art therapists argued their divergent viewpoints, which reflected the general disagreement about the applicability of art therapy assessments and to what extent such tools help or hinder clients (Horovitz, Agell, Gantt, Jones & Wadeson, 2001). Art therapy assessment cont inues to be a primary focus of debate in North America. Since art therapy is more developed in America and in England than it is in other countries, it is also necessary to consider the status of assessment in England. Most British art therapists work for the National Health Service and do not need to be concerned about insurance reimbursement (M. Hagood, personal communication, May 30, 2004). The therapeutic process is their primary focus. British art therapists generally frown upon the use of artwork for diagnosis (Hagood, 1990), and are disturbed by the common use of art therapy assessments in the United States (Burleigh & Beutler, 1997). Furthermore, while assessment is a popular subject on this side of the Atlantic, there is a notable absence of published literature on the topic overseas. In the British art therapy journal, INSCAPE, it appears that since its first edition in 1968, only two articles pertaining to assessment were published

(Gulliver, circa 1970s; Case, 1998). In the British Journal of Projective Psychology, only three articles relating to the evaluation of patient artwork have been published, and these focus solely on graphic indicators: no formal art therapy assessment instruments are discussed (Uhlin, 1978; Luzzatto, 1987; Hagood, 1992). In America, some attempts have been made to improve the status of assessment research via panel presentations at conferences (Cox, Agell, Cohen, Gantt, 1998, 1999; Horovitz et al., 2001); an informal survey of assessment use in child art therapy (Mills & Goodwin, 1991); and a literature review of projective techniques for children (Neale & Rosal, 1993). However, to date no comprehensive review of art therapy assessment or rating instrument literature has been published. Hackings (1999) small-scale weighted analysis of published studies on art therapy assessment research comes close in that she applied meta-analysis techniques to amalgamate published art-based assessment literature. Hackings work thereby provides a foundation for a more systematic analysis of this research and serves as a rationale for the present study. Meta-Analysis Techniques The meta-analysis literature is an appropriate resource for information related to conducting a systematic analysis of art therapy assessment literature. Cooper (1998) designated meta-analysis (also known as research synthesis or research review) as the most frequently used literature review method in the social sciences. A meta-analysis serves to replace those earlier papers that have been lost from sight behind the research front (Price, 1965, p. 513), and to give future research a direction so as to maximize the amount of new information produced. The meta-analysis method entails the gathering of sound data about a particular topic in order to attempt a comprehensive integration of previous research using measurement procedures and statistical analysis methods (Cooper, 1998). It requires surveying and analyzing large populations of studies, and involves a process akin to cluster sampling, treating independent studies as clusters that classify individuals according to the projects in which they participated (Cooper, 1998). A meta-analysis can serve as an important contribution to its field of focus. It can also generate consensus among scholars, focus debate in a constructive manner (Cooper, 1998), and give future research a direction so as to maximize the amount of new information produced (Price, 1965). The application of meta-analysis procedures to systematically analyze art therapy assessment research is conducive to: (1) amalgamating earlier assessment research so as to

provide a comprehensive source of integrated information; (2) identifying problems in statistical methods used by primary researchers and providing remedies; (3) addressing problems with poor cumulation; and (4) providing information to improve the development of additional art therapy assessments (Cooper, 1998). Hackings (1999) research lends support to the application of meta-analysis techniques in synthesizing art therapy assessment literature. First, Hacking cited a lack of orderly development building directly on previous research (i.e., poor cumulation [Rosenthal, 1984]) as a rationale for conducting a synthesis of the research, even despite the emphasis on qualitative studies in this area. Second, meta-analysis addresses the methodological difficulties identified with the literature review (Hacking, 1999, p 167): (1) Selective inclusion of studies is often based on the reviewer's impressionistic view of the quality of the study; (2) Differential subjective weighting of studies in the interpretation of a set of findings; (3) Misleading interpretations of study findings; (4) Failure to examine characteristics of the studies as potential explanations for disparate or consistent results across studies; (5) Failure to examine mediating variables in the relationship. Furthermore, a systematic analysis of art therapy assessment research is effective in amalgamating all earlier assessment research so as to provide a comprehensive source of integrated information, because Given the cumulative nature of science, trustworthy accounts of past research are a necessary condition for orderly knowledge building (Cooper, 1998, p. 1). Meta-analysis techniques also enable the identification of problems in statistical methods used by previous assessment researchers and provision of remedies, as was evidenced in Hackings (1999) review. This is important because The value of any single study is derived as much from how it fits with previous work as from the studys intrinsic properties (Cooper, 1998, p. 1). An in-depth study of art therapy assessments and rating instruments is warranted and timely. The purpose of assessment research is to discover the predictive ability of a variable or a set of variables to assess or diagnose a particular disorder or problem profile (Rosal, 1992, p. 59). Increasing this predictive ability is one purpose of conducting a systematic analysis of assessment research.

Practitioners have developed and used art therapy assessments, and there is every reason to believe that they will continue to do so. The present study applies meta-analysis techniques to integrate the research on art therapy assessments and rating instruments. Finally, the present synthesis assists in addressing some of the unanswered questions pertaining to this topic: these questions follow in the next section. Research Questions and Assumptions The overriding question of the present study is: what does the literature tell us about the current state of art therapy assessments? The study hypothesis is: there is no heterogeneity among the study variables identified in art therapy assessment and rating instrument literature. The research questions and assumptions are stated as follows: (1) Question: To what extent are art therapy assessments and rating instruments valid and reliable for use with clients? 1a) Assumption: Methodological difficulties on previous art therapy assessment research exist. 1b) Assumption: It is possible to address the problems of validity and reliability, to improve upon the existing tools, and to develop better tools. (2) To what extent do art therapy assessments measure the process of change that a client may experience in therapy? 2a) Assumption: Art therapy assessments measure the process of change that a client may experience in therapy. (3) Do objective assessment methods such as standardized art therapy tools give us enough information about clients? 3a) Assumption: The most effective approach to assessment incorporates objective measures such as standardized assessment procedures (formalized assessment tools and rating manuals; portfolio evaluation; behavioral checklists), as well as subjective approaches such as the clients interpretation of his or her artwork. Addressing these questions and assumptions in the present study contributes to the body of knowledge around this topic, and the results of the systematic analysis are a useful contribution to the field of art therapy.

Brief Overview of Study In undertaking the tasks of art therapy assessment and rating instrument research data collection and analysis, Coopers (1998) stages for conducting a meta-analysis provided a useful guideline: problem formulation; literature search; extracting data and coding study characteristics; data evaluation; and data analys is and interpretation. During the problem formulation stage, criteria for inclusion in the present study were identified. The literature search process entailed locating studies from three sources: informal channels, formal methods, and secondary channels. Once a satisfactory number of primary studies were found, it was determined what data should be extracted. A coding sheet was then designed. Two individuals were trained in coding procedures by the researcher, followed by evaluation of the data. This stage involved critical assessment of data quality: it was determined whether the data were contaminated by factors that are irrelevant to the central problem. For data analysis and interpretation, the researcher consulted with a statistician. This process involved some of the following steps: (a) simple description of study findings (Glass, McGaw, & Smith, 1981); (b) correlation of study characteristics and findings; (c) calculation of mean correlations, variability, and correcting for artifacts (Arthur, Bennett, & Huffcutt, 2001); (d) decision- making about whether to search for mediators; (e) selection of and testing for potential mediators; (f) linear analysis of variance models for estimation (Glass et al., 1981); (g) integration of studies that have quantitative independent variables; and (h) interpretation of results and making conclusions (Arthur et al., 2001). Finally, the results of the study are included in the present manuscript. Definition of Terms The following terms are referred to directly in this dissertation and/or they are relevant to the topic. Adding Zs method: The most frequently applied method (of 16 possible methods) used for combining the results of inference tests so that an overall test of the null hypothesis can be obtained (Cooper, 1998: see pp. 120-122 for details). Not all findings have an equal likelihood of being retrieved by the synthesist, so significant results are more likely to be retrieved than nonsignificant ones: This implies that the Adding Zs method may produce a probability level that underestimates the chance of a Type 1 error (p. 123). One advantage of the Adding Zs method is that it allows the calculation of a Fail-safe N.

Aggregate analysis : As opposed to estimating effect magnitude, aggregate analysis is an approach identified by Rosenthal (1991), whereby descriptive evidence across primary studies is integrated (Cooper, 1998). If the descriptive statistics can be put on a common metric so that they can be compared across studies, then they can be related to coded study characteristics (Hall, Tickle-Degnen, Rosenthal, & Mosteller, 1994). Art-based assessment : The use of any combination of directed or non-directed verbal, written, and art tasks chosen by the professional art therapist to: determine a clients level of functioning; formulate treatment objectives; assess a clients strengths; gain a deeper understanding of a clients presenting problems; and evaluate client progress. Chi-square (X2 ): Statisticians refer to X2 as an enumeration statistic (Animated Software Company, n.d.[a]). Rather than measuring the value of each of a set of items, a calculated value of Chi square compares the frequencies of various kinds (or categories) of items in a random sample to the frequencies that are expected if the population frequencies are as hypothesized by the investigator. Chi square is often used to assess the goodness of fit between an obtained set of frequencies in a random sample and what is expected under a given statistical hypothesis. For example, Chi square can be used to determine if there is reason to reject the statistical hypothesis that the frequencies in a random sample are as expected when the items are from a normal distribution. Coding sheet: A sheet created by the meta-analyst to tally information collected from primary research reports (Cooper, 1998). Cohens d index: The measure of an effect size used when the means of two groups are being compared (Cooper, 1998). It is typically used with t tests or f tests based on a comparison of two conditions. The d index expresses the distance between two group means in terms of their common standard deviation. Specifically, it is the difference between population means divided by the average population standard deviation (Rosenthal, 1994). Combined significance levels : Combined exact probabilities that are associated with the results of each comparison or estimate of a relation (Cooper, 1998). Becker (1994; see also Rosenthal, 1984) described 16 methods for combining the results of inference tests for obtaining an overall test of the null hypothesis. When the exact probabilities are used, the combined analysis results account for the different sample sizes and relationship strengths of each individual comparison.

Concurrent validity: The degree to which the scores on an instrument are related to the scores on another instrument administered at the same time (for the purpose of testing the same construct), or to some other criterion available at the same time (Fraenkel & Wallen, 2003, p. G-2). Convergent validity: Refers to the notion that two separate instruments will concur, or converge, upon similar results. The StanfordBinet IQ test should have convergent validity with the Wechsler IQ Scales (ACPA, n.d.). Criterion validity: A type of validity measurement that is used to evaluate the degree to which one measurement agrees with other approaches for measuring the same characteristic. There are three types of criterion validity called concurrent, predictive, and known groups (Google, n.d.). Effect size (ES): The degree to which the phenomenon is present in the population, or the degree to which the null hypothesis is false (Cohen, 1988, pp. 9-10). ES is a name given to a family of indices that measure the magnitude of a treatment effect (Becker, 1998). These are: odds ratio; relative risk; and risk difference. Unlike significance tests, these indices are independent of sample size. ES measures are the common currency of weighted analysis studies that summarize the findings from a specific area of research. There is a wide array of formulas used to measure ES. In general, ES can be measured in two ways: a) as the standardized difference between two means, or b) as the correlation between the independent variable classification and the individual scores on the dependent variable. This correlation is called the effect size correlation (Rosnow & Rosenthal, 1996) Fail-safe N (or Fail-safe sample size): Answers the question, How many findings totaling to a null hypothesis confirmation (e.g., Zst = 0) would have to be added to the results of the retrieved findings in order to change the conclusion that a relation exists? (Cooper, 1998, p. 123). Rosenthal (1979) called this the tolerance for future null results. The Fail-safe N is a valuable descriptive statistic (Cooper, 1998). It permits evaluation of the cumulative result of a synthesis against assessment of the extent to which the synthesist has searched the literature. The Failsafe N , however, also contains an assumption that restricts its validity. That is, its user must find credible the proposition that the sum of the unretrieved studies is equal to an exact null result (p. 124). File-drawer problem: is one aspect of the more general problem of publication bias (Becker, 1994, p. 228). The set of available studies does not represent the set of all studies ever

10

conducted, and one reason for this is that researchers may have reports in their file drawers that were never published because their results were not statistically significant. Fishers Z: In meta-analysis, a statistical procedure used to transform a distribution of rs so as to make up for the tendency of rs to become skewed when determining effect size estimates. Zr is distributed nearly normally, and in virtually all meta-analytic procedures r should always be transformed to Zr (Rosenthal, 1994). Fixed effects/fixed estimates effects (conditional) model: (vs. Random effects [unconditional] model) In this model, the universe to which generalizations are made consists of ensembles of studies identical to those in the study except for the particular people (or primary sampling units) that appear in the studies (Hedges, 1994, p. 30). Hedges and Olkin adjustment for small sample sizes: A more precise procedure for adjusting sample size than merely multiplying each estimate by its sample size and then dividing the sum of these products by the sum of the sample sizes (Cooper, 1998). The adjustment has many advantages but also has more complicated calculations (see Hedges and Olkin, 1985). Homogeneity analysis: Compares the observed variance to that expected from sampling error, and is the approach used most often by meta-analysts (Cooper, 1998). It includes a calculation of how probable it is that the variance exhibited by the effect sizes would be observed if only sampling error was making them different (p. 145). Inter-coder reliability: The reliability between codings indicated by two or more individuals who tally the contents of a primary research article on a coding sheet used in a meta-analytic study (see Orwin, 1994; and Stock, 1994). Inter-rater reliability: The degree to which different raters or observers give consistent estimates of the same phenomenon (Yazdani, 2002a). Inter-relationship between Reliability and Validity: For an instrument or test to be useful, both reliability and validity must be considered. A test with a high reliability may have low validity: i.e., results may be very consistent but inaccurate. A reliable but poorly validated instrument is totally useless. A valid but unreliable instrument can be of some value, thus it is often said that validity is more important than reliability for a measurement. However, to be truly useful, an instrument must be both reasonably valid and reasonably reliable. Interaction effects: Interaction between two factors exists when the impact of one factor on the response depends on the setting of the other factor.

11

Kappa: The improvement over chance reached by coders: Cohens kappa is measure of reliability that adjusts for the chance rate of agreement (Cooper, 1998). Maximum likelihood estimates: Are the model coefficient values that maximize the likelihood of the observed data (Tate, 1998, p. 342). (See also Pigott, 1994, pp. 170-171.) Meta-analysis: The process of gathering sound data about a particular topic in order to produce a comprehensive integration of the previous research (Cooper, 1998). Mediating variable : A variable that intervenes with a relationship between variables. Formerly known as moderating variable. Odds ratio : The ratio of the odds of an event in one group divided by the odds in another group (Evidence Based Emergency Medicine, n.d.). When the event rate or absolute risk in the control group is small (less than 20% or so), then the odds ratio is very close to the relative risk. Poor cumulation: Lack of orderly development building directly on previous research (Rosenthal, 1984). Projective technique : A clinical technique qualifies as a projective device if it presents the subject with a stimulus, or series of stimuli either so unstructured or so ambiguous that their meaning for the subject must come in part from within himself (Hammer, 1958, p. 169). Publication bias: Bias that is induced by selective publication, in which the decision to publish is influenced by the results of the study (Begg, 1994, p. 399). Q statistic : Used to test whether a set of d indexes is homogeneous: Hedges and Olkin (1985) identified the Q statistic, or Qt (Cooper, 1998). It has a Chi square distribution with N 1 degrees of freedom, or one less than the number of comparisons (see Cooper, p. 146). Q-between is used to test whether the average effects from groupings are homogeneous (Cooper, 1998), and Q-within is used to compare groups of r indexes. R index: The Pearson product- moment correlation coefficient (Cooper, 1998). It is the most appropriate effect size metric for expressing an effect size when one wishes to describe the relationship between two continuous variables. Random effects (unconditional) model: (vs. Fixed effects/fixed est imates effects model) In this model, the study sample is presumed to be literally a sample from a hypothetical collection (or population) of studies. The universe to which generalizations are made consists of a population of studies from which the study sample is drawn (Hedges, 1994, p. 31).

12

Reliability: Means repeatability or consistency (Yazdani, 2002a). A measure is considered reliable if it would give us the same result over and over again (assuming that what we are measuring isnt changing). Because the measurements are taken repetitively to determine reliability, measurement reliability is often referred to as test-retest reliability. Rating instrument : A scoring procedure that enables the examiner to quantify, evaluate, and interpretbehavior or work samples (AERA, 1999, p. 25). T-test: The t test employs the statistic (t) to test a given statistical hypothesis about the mean of a population (or about the means of two populations) (Animated Software Company, n.d.[b]). Test of equivalence of proportion: Indicates the homogeneity of effect size for each variable and their relation, and is inappropriate when the vast majority of non-significant results are not available (in which case the assumption of p=1 would create a false disparity between the significant and non-significant findings, imposing hetereogeneity) (Hacking, 1999). Type I error: Incorrectly rejecting the null hypothesis when it is true (Tate, 1998). Validity: A property or characteristic of the dependent variable. An instrument (measuring tool) is described as being valid when it measures what it is supposed to measure (Yazdani, 2002b). A test cannot be considered universally valid. Validity is relative to the purpose of testing and the subjects tested and therefore an instrument is valid only for specified purposes. Also it should be made clear that validity is a matter of degree. Instruments or tests are described by how valid they are, not whether they are valid or not. Vote-counting methods: The simplest methods for combining independent statistical tests (Cooper, 1998). Vote counts can focus only on the direction of the findings or they can take into account the statistical significance of findings. However, vote counting has been criticized on several grounds (Bushman, 1994). (See also Cooper, 1998, pp. 116-120; and Bushman, 1994, pp. 193-214.) Weighted (wd) technique : Produces an unbiased estimate of effect size for the corrected group sizes (Wolf, 1986). Z score: A special application of the transformation rules (Animated Software Company, n.d.[c]). The Z score for an item indicates how far and in what direction that item deviates from its distributions mean, expressed in units of its distributions standard deviation. The mathematics of the Z score transformation are such that if every item in a distribution is converted to its z score, the transformed scores will necessarily have a mean of zero and a

13

standard deviation of one. Z scores are sometimes called standard scores. The Z score transformation is especially useful when seeking to compare the relative standings of items from distributions with different means and/or different standard deviations. Z scores are especially informative when the distribution to which they refer is normal. In every normal distribution the distance between the mean and a given Z score cuts off a fixed proportion of the total area under the curve. Statisticians have created tables indicating the value of these proportions for each possible Z score. Conclusion This chapter presented information about the research problem to be investigated. Specifically, the purpose of the present study was discussed. The arguments in favor and opposed to the use of art therapy assessments were explained, and a rationale for analyzing the research was stated, in an attempt to justify the study. The research questions, assumptions, and a brief overview were provided. The final section included the Definition of Terms. The next chapter, the literature review, further anchors the rationale for this body of work, A Systematic Analysis of Art Therapy Assessment and Rating Instrument Literature. The review aids in directing the present study and provides increased knowledge in this subject area, thereby reducing the chance of duplicating others ideas (Fraenkel & Wallen, 2003).

14

CHAPTER 2 LITERATURE REVIEW In this review of the literature, the foundations of art therapy assessment instruments are presented to provide information about historical milestones in psychological evaluation that influenced the development of art therapy tools. Criticisms of the original projective assessment tests are detailed, followed by a review of published materials in the field of art therapy that had a further impact on this area. The importance of art therapy assessments is described and contrasted with a discussion of practical problems and philosophical and theoretical issues. Avenues for improvement of art therapy assessments are delineated, including a section on computer technology and the advancement of art the rapy assessment methods, and analyzing the research on art therapy assessment instruments. This is supported with references from the literature on syntheses of psychological assessment research and syntheses of creative arts therapies assessment research. Historical Foundations and Development of Art Therapy Assessment Instruments An overview of the foundations of art-based and projective assessment procedures illustrates their impact on the field of art therapy. Psychologists, psychiatrists, anthropologis ts, and educators have used artwork in evaluation, therapy, and research for over 100 years (MacGregor, 1989). From 1885-1920, educators collected and classified childrens art (D. B. Harris, 1963). In 1887, Corrado Ricci, an art critic with interests in psychology, published the first known book of childrens art, in which drawings were presented as potential psychodiagnostic tools (J. B. Harris, 1996). In 1931, Eng published an extensive bibliography of the literature on childrens art. This review covered English, German, French, and Norwegian publications from 1892 to 1930, citing early developmental and descriptive studies. Prior to 1900, a number of scientific articles describing the spontaneous artwork of mental patients were published in the United States and Europe (Lombroso, 1882; Simon, 1888; Tardieu, 1886; Hrdlika, 1899). These papers were mostly impressionistic. Several studies related

15

to the psychodiagnostic potential of art followed in 1906 (Klepsch & Logie, 1982), including Fritz Mohrs work in establishing standardized procedures and methods using drawing tests with psychiatric patients (Gantt, 1992). Mohr, a German investigator, reviewed the 19th century literature and dismissed the earlier contributions as merely descriptive, especially those of the French (MacGregor, 1989, p. 189). Mohrs work served as a foundation for certain psychological projective instruments such as the House-Tree-Person Test (HTP) (Buck, 1948), and the Thematic Apperception Test (TAT) (Murray, 1943), as well as some evaluative procedures used by art therapists. The structural elements of an artwork were of particular concern to Mohr, because in his experience, they revealed information about the artists thought processes. He found that the more fragmentary the picture, the more fragmentary the thought process. Hermann Rorschach published his famous inkblot projective test in 1921, and it is widely used even today (Walsh & Betz, 2001). In 1929, another popular tool was developed: the Goodenough Draw-A-Man technique. Draw-A-Man was the first systematized art-based assessment method for estimating intelligence. Currently known as the Goodenough-Harris Draw-A-Man Test (D. B. Harris & Roberts, 1972), this instrument is the earliest example of a class of open-ended drawing tasks called human figure drawings (HFDs) and has since been incorporated into IQ tests such as the Stanford-Binet. Other work also led to the use of drawings in nonverbal intelligence tests, including Lowenfelds (1947) research demonstrating that children pass through a sequence of orderly developmental stages in their drawings (Gantt, 1992). Criticisms of the Original Assessment Tools For 50 years, the research on projective drawings has yielded mixed results (Gantt & Tabone, 1998, p. 8). During the 1970s and 1980s, the use of these tools declined due to decreased belief in psychoanalytic theory, greater emphasis on situational determinants of behavior, questions regarding the cost-effectiveness of these tools, and poor reviews about their validity (Groth-Marnat, 1990). Although projective tests are still popular among psychologists, several authors have questioned their scientific value, pointing to questionable research findings (Chapman & Chapman, 1967; Dawson, 1984; Kahill, 1984; Klopfer & Taulbee, 1976; Roback, 1968; Russell- Lacy, Robinson, Benson, & Cranage, 1979; Suinn & Oskamp, 1969; Swensen, 1968; Wadeson & Carpenter, 1976).

16

According to the Buros Institute Test Reviews Online (n.d.[a]), the Draw-A-Person (DAP) test was last reviewed in the Seventh Mental Measurements Yearbook in 1972. Roback (1968) examined 18 years (1949-1967) of findings on the Draw-A-Person (DAP) Test. Overall, the studies cited failed to support Machovers (1949) hypothesis, that drawing a person is a natural vehicle for the expression of ones body needs and conflicts, and that the figure drawn is related to the individual artist with the same level of intimacy characterizing that individuals handwriting, gait, or any other of his or her own expressive actions. It was conc luded that there is a great need for validated and standardized scales for the use of figure drawings in estimating personality adjustment. Swensens 1968 review of human figure drawing studies published since 1957 revealed that the quality of research in this area had improved considerably. The evidence suggested that the reliability of a specific aspect of a drawing is directly related to its validity: global ratings were found to be the most valid and reliable, whereas individual signs were found to be the least valid and reliable. It was also found that the presence of certain signs was related to the overall quality of the drawings, and as such, it was suggested that future research should control for the quality of the drawings. In contrast to Roback (1968), Swensen concluded that his findings provided support for the use of human figure drawings in assessment. In their 15-year review of the personality test literature, Suinn and Oskamp (1969) found only a small amount of evidence relating to the valid ity of even the most popular tests, including the DAP and the HTP. They summarized their findings as follows: Reviewing the results of studies of drawing tests, it appears that their validity is highly tenous [sic]. The major assumption of true body projection in drawings does not necessarily hold. In addition, artistic skill may be a distinct influence on drawings. Doubt has been cast on the hypothesis that the sex of the first drawn figure is a good index of sexual identification, and the scar-trauma hypothesis has had conflicting results. There is some evidence of the usefulness of the Draw-APerson Test in screening adjusted from maladjusted individuals or organics from functional disorders, but the usefulness of the test in individual prediction is limited. There has been very little worthwhile research on the ability of the House-Tree-Person Test to predict diagnoses, personality traits, or specific outcomes. (pp. 129-130)

17

This study further substantiated the case against the use of these tools. Klopfer and Taulbee queried, Will this be the last time that a chapter on projective tests appears in the Annual Review of Psychology? Will the Rorschach be a blot on the history of clinical psychology? (1976, p. 543). These witty writers reviewed more than 500 journal articles pertaining to projective techniques from 1971 through 1974, and determined that if projective techniques are dead, some people dont seem to have gotten the message (p. 543). To justify their hesitation in conducting a comprehensive literature review, Klopfer and Taulbee stated, Even if one were to consider research on projective tests as the beating of a dead horse, a lot of people seem eager to get in on the flogging, so much so that the voluminous nature of the current literature makes an exhaustive review impossible (pp. 543-544). As such, they explored problems of validation and stressed the three most widely used projective tests -- the TAT, the Rorschach, and Human Figure Drawings. These three projectives accounted for more than 70% of all the references identified. The most distinct contributions of tests were noted, especially related to the finding that personality and motivation did not fit the behavioral or self- concept categories. Klopfer and Taulbee concluded that psychologists would probably continue to develop, use, and rely upon projective instruments as long as they maintain an interest in the inner person and probing the depths of the psyche. In 1976, Wadeson and Carpenter published a comparative study of art expression of patients with schizophrenia, unipolar depression, and bipolar manic-depression. The artworks of 104 adult inpatients with affective psychoses and 62 inpatients with acute schizophrenia were examined. Little support was provided for the hypotheses, and substantial within-diagnostic group variability and between-group overlap was seen. However, some trends in the hypothesized directions were identified, but these disappeared when a subsample of age- matched patients was compared. Despite these findings, patient artworks and associations to the pictures were found to be valuable in understanding the patient, regardless of diagnosis. Russell- Lacy et al. (1979) studied the validity of assessing art productions made by 30 subjects with acute schizophrenia as a differential diagnosis technique. The subjects pictures were hypothesized to differ from pictures by other acute psychiatric patients and by subjects with no diagnosis. The only element found to be associated specifically with schizophrenia was repetition of abstract forms. Factors associated with psychiatric admission, regardless of diagnosis, included the presence of pictorial imbalance, overelaboration, childlike features,

18

uncovered space, detail, and color variety. It was concluded that the use of art as a technique in differential psychiatric diagnosis is questionable. In 1984, Dawson investigated differences between the drawings of depressed and nondepressed adults. A method for obtaining objective scores for content and structural variables was developed. Participants were patients of a Veterans Administration Medical Center who scored either on the high end or the low end of the Beck Depression Inventory. It was hypothesized that the drawings of depressed subjects would have less color, more empty space, smaller forms, more missing details, more shading, and fewer extra details than those of nondepressed subjects. It was also anticipated that specific contents would be found to be more prevalent in the drawings of subjects who reported suicidal ideation and depressed subjects. A linear combination of variables was expected to significantly differentiate the drawings of nondepressed and depressed subjects. The Depressed group left significantly more empty space in their drawings and included fewer extra details than the Nondepressed group. The difference between the group means was in the predicted direction but was not significant for the variables: size, color, missing details, and suicide symbols. A discriminant function analysis of the variables did not discriminate between the drawings of the depressed and nondepressed subjects above a chance level. Some support for the hypotheses was found, which provided a rationale for continued research in this area. It was suggested that future research include the exploration of other measures of depression as criteria for identifying the groups used to analyze drawing variables, and the investigation of the structural variables, Empty Space, Size, Color, Extra Details and Missing Details, in the drawings of other clinical groups. Kahill (1984) examined the quantitative literature published between 1967 and 1982 on the validity and reliability of human figure drawing tests used as projectives with adults. Focusing on the assertions of Machover (1949) and Hammer (1958), Kahill discussed reliability estimates and evidence pertaining to the body- image hypothesis. Validity of structural and formal drawing variables (e.g., size, placement, perspective, size, and omission) and the content of figure drawings (e.g., face, mouth and teeth, anatomy indicators, and gender of first-drawn figures) was addressed, and the performance of global measures and the influence of confounding factors was described. It was concluded that establishing the meaning of figure drawings with any predictability or precision is difficult due to the inadequacies of figure drawing research.

19

The historical foundations of projective and drawing assessment techniques derived from the field of psychology established precedence for the development of similar techniques in the field of art therapy. In the 1950s, when art therapy came about simultaneously in the United States and in England, it was not long before art therapists identified a need for assessment methods that could provide the client with a wider variety of fine art materials than merely a pencil and a small piece of paper. Development of Art Therapy Assessments and Rating Instruments Art therapy assessments. Some of earliest standardized art therapy instruments that have influenced the development of subsequent art therapy tools include the Ulman Personality Assessment Procedure (Ulman, 1965, 1975, 1992; Ulman & Levy, 1975, 1992), the Family Art Evaluation (Kwiatkowska, 1975, 1978), and Rawley Silvers tests (1983, 1988/1993, 1990, 1996, 2002). The Ulman Personality Assessment Procedure (UPAP) had its beginnings in a psychiatric hospital. In 1959 the hospitals chief psychologist began sending patients to art therapist Elinor Ulman so that she could use art as a method of providing quick dia gnostic information (Ulman, 1965, 1975, 1992). Ulman developed the first standardized drawing series: materials included four pieces of gray bogus paper and a set of 12 hard chalk pastels. The patient was asked to complete the series of four drawings in one single session, and each drawing had a specific directive. This diagnostic series was very influential in the development of other tools such as the Diagnostic Drawing Series (DDS) (Cohen, Hammer, & Singer, 1988). Ulman did not develop a standardized rating system for the UPAP, but she did make recommendations for future research (Ulman & Levy, 1975, 1992). She suggested that instead of focusing on content of pictures, that form and its correlation with personal characteristics might enhance reliability in the use of art-based assessment (p. 402). Gantt and Tabone (1998) noted Ulmans recommendations and designed a rating system with formal elements. During her tenure at the National Institute of Mental Health from 1958 until 1973, art therapist Hanna Yaxa Kwiatkowska (1975, 1978) developed a structured evaluation procedure for use with families. Kwiatkowska, influenced by Ulmans seminal work, had this to say about her contemporary: Her (Ulmans) exquisite sensitivity and broad experience allowed her to provide important diagnostic conclusions drawn from four tasks given to the patients investigated individually (1978, p. 86). Kwiatkoskas instrument, known as the Family Art

20

Evaluation, consists of a single meeting of all available members of the nuclear family. The family is asked to produce the following drawings: 1) a free picture; 2) a picture of your family; 3) an abstract family portrait; 4) a picture started with the help of a scribble; 5) a joint family scribble; 6) a free scribble. Following completion of the drawings, the art therapist facilitates a discussion with the family about the artwork and the process. Kwiatkowskas evaluation is significant as one of the earliest standardized evaluation procedures developed by an art therapist. Art therapist Rawley Silver became interested in assessment in the 1960s (Silver, 2003). Her doctoral dissertation, The Role of Art in the Conceptual Thinking, Adjustment, and Aptitudes of Deaf and Aphasic Children (1966), was influential in the development of Silvers assessments, The Silver Drawing Test of Cognition and Emotion (SDT) (1983, 1990, 1996, 2002), and the Draw A Story (DAS) (1988, 1993, 2002). The SDT includes three tasks: predictive drawing, drawing from imagination and from observation. The DAS is a semistructured interview technique using stimulus drawings to elicit response drawings, and has had a considerable impact on the field of art therapy. While Ulmans, Kwiatkowskas, and Silvers work was valuable in influencing the development of additional art therapy instruments, more contemporary researchers have made contributions to the development of systems to rate art-based assessment tools. Rating instruments. To reiterate the AERA definition, a rating instrument is a scoring procedure that enables the examiner to quantify, evaluate, and interpretbehavior or work samples (1999, p. 25). Most art therapy assessment rating instruments are comprised of scales used to determine the extent to which an element is present in a drawing (such as amo unt of space used in the picture). A rating scale presents a statement or item with a corresponding scale of categories, and respondents are asked to make judgments that most clearly approximate their perceptions (Wiersma, 2000, p. 311). Rating instruments vary in the types of scales that they use to measure test items. Generally there are four types of scales, each of which has a different degree of refinement in measuring test variables: nominal, ordinal, interval, and ratio (Aiken, 1997). The question is which type of scale is the best to use in a rating instrument. In addition to using references from the literature to address this question, the author surveyed members of the American

21

Psychological Association Division 5, Measurement, Evaluation and Statistics, via their listserve. Nominal and ordinal measures are convenient in describing individuals or groups (Aiken, 1997). With an ordinal scale that forces the rater to choose either good or bad, present or not present, for example, consistent responses, resulting in higher inter-rater reliability, are more likely (S. Rock, personal communication, February 21. 2005). However, comparing the numbers in terms of direction or magnitude is illogical (Aiken, 1997). Furthermore, many things are not simply yes or no in the real world, but gradations a yes/no answer loses a lot of information and forces people to set a criterion for judgment. How true does it have to be before I say yes (N. Turner, personal communication, Februrary 21, 2005)? Used in conjunction with the Diagnostic Drawing Series (DDS) (Cohen, Hammer & Singer, 1988), the DDS rating system (Cohen, 1986/1994) (Appendices D and E) is comprised of 23 scales. Many DDS scales are ordinal and force a choice between two items, such as yes or no (presence or absence of a given item), yet the criteria that are being rated are not trivial (not superficial and readily ratable) (A. Mills, personal communication, February 21, 2005). The Descriptive Assessment of Psychiatric Artwork (DAPA) (Hacking, 1999), an instrument used for rating spontaneous artwork (i.e., any type of drawing or painting), is comprised of five scales: Color, Intensity, Line, Area, and Emotional Tone (Appendix F). Like the DDS, the DAPA has ordinal scales. The most precise level of measurement is the ratio scale, which includes a zero value to indicate a total absence of the variable being measured (Aiken, 1997). This true zero, coupled with the equal intervals between numerical values on a ratio scale, enables measurements to be explained in a meaningful way. However, the more choices the rater has, such as with an interval or ratio-type scale, the more likely the scores will vary from rating to rating (S. Rock, personal communication, February 21. 2005). Another advantage of interval/ratio (sometimes referred to as Likert-type or Graphic) scales is that many variables are gradations rather than just yes or no (N. Turner, personal communication, Februrary 21, 2005). The trade off is that it takes people longer to read Likert or interval/ratio scales than nominal/binary or ordinal check lists. Intervals in measurement scalesare established on the basis of convention and usefulness. The basic concern is whether the level of measurement is meaningful and that the implied information is contained in the numerals. The meaning

22

depends on the conditions and variables of the specific study. (Wiersma, 2000, p. 297) Graphic/interval rating scales are advantageous because they are simple and easy to use, and they suggest a continuum and equal intervals (Kerlinger, 1986). In addition, these scales can be structured with variations such as continuous lines, vertical segmented lines, and lines broken into marked equal intervals. Six or seven choices (such as a scale ranging from very strongly disagree to very strongly agree, with more moderate items in between) will increase reliability (Hadley & Mitchell, 1995). Likert scales (regardless of the number of rating points) are assumed by many to be essentially interval, reflecting an underlying interval scale of measurement (S. Rock, personal communication, February 21. 2005). Others argue that, while the underlying dimension is interval (an even ratio), the scale is at best an ordinal scale. To overcome this ambiguity, many professionals treat Likert scales as interval scales and move on with their work. The FEATS (Gantt & Tabone, 1998) (Appendices G and H), developed for rating the Person Picking an Apple From a Tree (PPAT) assessment drawings (Gantt, 1990), is an example of an equalappearing Likert/interval scale (L. Gantt, personal communication, January 31, 2005). The intervals between the numbers on the FEATS scales cannot be assumed to be exactly equivalent all along the scale for example, a four cannot be assumed to be exactly twice as much as a two. A strength of the FEATS is that it provides sample drawings that guide the rater, thereby increasing the FEATS reliability, but at a cost: Such sample descriptions lengthen the raters task, particularly when it includes many ratings. A graphic rating scale if often (but not always) sufficiently clear if only the end points are given sample behavior descriptions and the idea of equal intervals is allowed to carry this burden regarding the intermediate scale points. (Hadley & Mitchell, 1995, p. 329) Nominal and ordinal measures should not be compared in terms of direction or magnitude, but they are more likely to produce consistent responses, whereas interval/ratio scales can be compared in terms of direction or magnitude, but the scores will be more variable. So which type of scale is better? The choice should be specific to the instruments purpose: There is no conclusive evidence for using Likert-type versus binary-choice items in rating instruments. The format that best represents the underlying construct

23

you are trying to measure should guide the selection of format. One must define the purpose of the scale, weigh the pros and cons of various format options including score interpretationand make the best choice while being aware of the limitations of score interpretation due to format. Both methodsand points of viewhave value as long as we realize their limitations as well. (B. Biskin, personal communication, February 22, 2005) Content checklists are sometimes included in an art based rating instrument, separate from the scales. As the term checklist suggests, these are typically comprised of nominal items that would be checked on the list as either present or not present in a drawing. For example, in addition to the 23 DDS scales, there is a content checklist. The FEATS also has a content tally sheet. The rater is asked to place a checkmark for all items they see in the picture they are rating, such as whether the orientation of the picture is horizontal or vertical. The preceding discussion sheds light on the importance of thorough rating instrument design. An assessment is only as good as the system used to rate it, because well-constructed, standardized scales for rating artwork are vital in order to validate assessment findings and to determine the reliability of subjects scores. Assessment is an imperfect science, and as has been presented in this literature review, the variety and quality of the research is diverse. An in-depth discussion of this hotly debated topic is included later in the chapter. First, the broad implications of working with assessment tools should be considered, beginning with the reasons for their importance. Why Art Therapy Assessments Are Important It is the consensus of most mental health professionals, agency administrators, and insurance companies that regardless of the formality or structure, assessmentand reassessment at appropriate timesconstitutes the core of good practice (Gantt, 2004, p. 18). Furthermore, funding for research, treatment, or programming is only provided to those who demonstrate the efficacy of treatments or interventions. Standardized assessments are fundamental to all disciplines that deal with intervention and change, including the field of art therapy. Assessme nts are used in different settings to plan intervention or treatment and to evaluate results (Gantt, 2004). Some examples of this include: the Federal governments mandate on the use of assessments in certain facilities; the Joint Commission on Accreditation of Healthcare Organizations (JCAHO) regular inspection of evaluation and assessment procedures

24

in selected institutions; public schools use of standardized tests for the Individualized Education Plan (IEP); and the National Council on Agings delineation of standards for assessment. It is thought that the ongoing use and development of art therapy assessment tools is important to the advancement of the field. It has been suggested that exploration of original ways to evaluate clients in art therapy be encouraged, as creative investigation can be fruitful (Betts, 2003, p. 77). Many art therapy practitioners believe that assessments provide increased understanding of a clients developmental level, emotional status, and psychological framework. Such tools are also used for formulating treatment goals and gaining a deeper understanding of the clients presenting problems. Clinicians are under pressure to demonstrate client progress in therapy. In art therapy, for instance, an assessment can be administered at the outset of treatment, during the middle phase of treatment, and again upon termination of services, and the artwork can be compared to determine the course of client progress. When practitioners and institutions are accountable for charting and reporting client progress, treatment standards are raised, and this has a trickle-down effect that tends to improve the quality of treatment a client receives (Deaver, 2002; Gantt & Tabone, 2001). Deaver (2002) asserted that research might be beneficial in understanding the efficacy of various art therapy assessments, techniques, and approaches used with clients. She presented basic descriptions and examples of qualitative and quantitative approaches to art therapy research, and put forth ideas to bridge the gap between research and practice, within the context of providing improved services for clients. In 2001, Gantt and Tabone presented data on the PPAT and FEATS, to demonstrate how these instruments assisted them in making clinical decisions and identifying predictor variables. They found that PPAT drawings served as an effective aid in predicting how patients would respond to specific treatments, and this resulted in shorter treatment time. Shorter treatment helped the hospital to increase their efficiency, and to provide patients with improved quality of treatment. Advocates of assessment assert that the various instruments help to provide meaningful information about clients (Rubin, 1999). Art therapy assessments may be beneficial in terms of observing patterns, generating comparative data, and addressing issues of reliability and validity (Malchiodi, 1994). Assessments are used to gain insight into a clients mood and psychological

25

state, and to unveil diagnostic information. A clients spontaneous artwork can also be used to make diagnostic impressions. Art can tell us much not only about what clients feel but also about how they see life and the world, their unique flow of one feeling into another, and the deep structure that underlies this flow of feeling (Julliard & Van Den Heuvel, 1999, p. 113). A clients artwork enables the therapist to perceive the clients socio-cultural reality: the clients feelings about himself or herself, his or her family, environment, and culture. Although there are many benefits to justify the use of art therapy assessment techniques, there are also several problems. Most clinicians have mixed opinions about the applicability of assessments. These are elucidated in the next section. Issues With Art Therapy Assessments Practical Problems Some of the problems with art therapy assessment instruments are concrete and relate to lack of scientific rigor. Many tools are generally deficient of data supporting their validity and reliability, and are not supported by credible psychological theory (McNiff, 1998). Those who choose to assess clients through art have neglected to convincingly address the essence of empirical scientific inquiry findings that link character traits with artistic expressions; replicable results based upon copious and random data; and uniform outcome measures which justify diagnosis of a client via his or her artwork. Gantt and Tabone (1998) identified two problematic methods that were used in the formative years of assessment. Psychologists employed a testing approach, and looked for nomothetic (group) principles, stressing validity and reliability. Their principles were based on personality characteristics. The disadvantage to this method was that it used a sign-based procedure that took material out of context. Conversely, psychoanalysts and art therapists perceived art as a reflection of mood or progress and attempted to understand the individual more thoroughly. This approach was faulty in that it lacked scientific rigor. Interpretation of pictorial imagery is highly subjective (McNiff, 1998). It is a challenge to maintain objectivity and thereby establish validity in assessing art because artistic tensions within the total form are images of intuitively felt activity (Julliard & Van Den Heuvel, 1999, p. 114); and because art expresses a state of constant change and growth. Furthermore, unless the crudest diagnosis patient or normal can be made with sufficient precision, the assumption that paintings and drawings contain data related in regular ways to psychopathological categories

26

lies open to serious question (Ulman & Levy, 1992, p. 107). For example, how can a patients drawing accurately reveal whether he or she has schizophrenia? As Golomb (1992) maintained in her critical review of projective drawing tests relating to the human figure, far-reaching conclusions and the inconsistent results of numerous replication studies indicate the dubious state of research in this area. For example, claims that the human figure drawn by an individual relates intimately to the impulses, anxieties, conflicts and compensations characteristic of that individual remain difficult to demonstrate due to problems of measurement validity (Swensen, 1968). Many studies with children have used matched groups, but have failed to control for age differences (Hagood, 2002). The literature is poorly cumulated, i.e., there is a lack of orderly development building directly on the older work (Hacking, 1999, p. 166). Many studies conclude with unacceptable results and call for further research: such studies reflect the casualness with which many art therapists approach the art therapy assessment process (Gantt, 2004; Mills & Goodwin, 1991; Phillips, 1994). A major flaw in much of the art therapy assessment research relates to poor methods of rating pictures, and inappropriate use of statistical procedures for measuring inter-rater reliability. Hacking (1999) cited several studies that used unsuitable methods to determine interrater reliability. Nine such studies (Bergland & Moore Gonzalez, 1993; Cohen & Phelps, 1985; Gantt, 1990; Kaplan, 1991; Kirk, & Kertesz, 1989; Langevin, Raine, Day, & Waxer, 1975a; Langevin, Raine, Day, & Waxer, 1975b; McGlashan, Wadeson, Carpenter, & Levy, 1977; Wright & Macintyre, 1982) indicated a high value of r interpreted as an indication of good agreement. However, the correlation was stated to be inappropriate in this context, because the correlation coefficient is a measure of the strength of linear association between two variables, not agreement (p. 159). Furthermore, assessing agreement by a statistical method that is highly sensitive to the choice of the sample of subjects is unwarranted. Hacking cited Kays (1978) famous study for incorrectly judging agreement by using a Chi square test, which is also a test of association. A study by Wadlington and McWhinnie (1973) was criticized by Hacking as using the comparison of means by a paired t-test, which is a hypothesis test. Similarly, Hacking found that the Russell- Lacy et al. (1979) study used 60 judges in groups of 10 to rate 5 pictures and compared the variation between scores of 0-10 agreements between groups, using the category

27

ranking test Friedmans Anova; however, this test is also inappropriate for determining interrater reliability, as it is yet another test of association. Methods cannot be deduced to agree well because they are not significantly different. A high scatter of differences may well lead to a crucial difference in means (bias) being non significant. Using this approach, worse agreement decreases the chance of finding a significant difference and so increases the chance that the methods will appear to agree. Despite the authors claims of good statistical agreement in study 69 (Wadlington & McWhinnie, 1973), most of the discussion reported their difficulties with the measure seriously affected their study results and recommended a shorter form for better reliability. (Hacking, 1999, p. 160) Hacking suggested, the simplest approach is to see how many exact agreements exist (p. 160). She cited 7 studies that reported percentage agreement by tables of elements or overall agreement (Table 1).

Table 1. Studies Reporting Percent Agreement (Hacking 1999) Mills, Cohen & Meneses 1993a 1993b Silver & Ellison 1995a and 1995b Cohen & Phelps 1985 Miljkovitch & Irvine 1982 95.7% agreement for 2 raters decreased to 77% for 29 raters. 94.3% agreement for 2 raters decreased to 61% for 10 raters. Good agreement for 2 raters decreased to poor agreement for 4 raters. Reported 0.96 which, it is assumed, represents percentage agreement as there is no other information. Percentage agreement figures appeared to be reasonably high, but these could be unreliable when more raters are added. Reported 0.97 which, it is assumed, represents percentage agreement as there is no other information, percentage agreement figures appeared to be reasonably high, but these could be unreliable when more raters are added.

Sims, Bolton, & Dana 1983

Hackings analysis of the DDS (Cohen, Hammer & Singer, 1988) further illustrates the problems with art-based assessments:

28

The DDS [Cohen, Hammer, & Singer, 1988] (is) one of few tests which attempt to validate, reliably rate their instrument and encourage replications. Described as a standardised [sic] evaluation supported by extensive research [Cohen, Mills, & Kijak, 1994], only 3 interrater studies have been included in this analysis: study 48 [Mills, Cohen, & Meneses, 1993a] reports agreement scores from 77-100% over 23 categories, giving 95.7% overall after 2 months training of the 2 main authors rating 30 sets of drawings by undescribed subjects. Study 49 [Mills, Cohen, & Meneses, 1993b] reports only 77% agreement between 29 naive raters performing the same measurements. Study 52 [Rankin, 1994] reports 96% agreement between raters of 4 details in tree drawings, by 30 patients with post traumatic dissociative disorder and 30 controls, taken from the DDS rating guide and protocol. Other studies used peculiar methods and were not included in this analysis. (1999, p. 61) Two additional weaknesses related to inter-rater reliability were found in the calculation of agreement (Hacking, 1999): (1) a lack of accounting for where the agreement is located in the table, and (2) the fact that some agreement between raters is expected by chance. Hacking suggested that it would be more reasonable to consider agreement in excess of the amount by chance (p. 162), and found that Langevin and Hutchins (1973) study was the only one which met this criterion. Hacking concluded that the best approach to this type of problem is that adopted by Knapp (1994), and McGlashan et al. (1977), the Kappa statistic. Kappa may be interpreted as the chance corrected proportional agreement, but Hacking emphasized that is important to show the raw data, which the Knapp and McGlashan studies failed to do. In support of this statement, Hacking cited Neales (1994) application of the DDS to children as having a much lower level of reliability than that reported by Mills (1993a, 1993b): only 12 variables reached significance using the Kappa measure of agreement between 2 raters (Hacking, 1999, p. 162). The multitude of problems with art therapy assessment instruments and supporting research, particularly related to inter-rater reliability, are evident. Philosophical and Theoretical Issues Some art therapists are opposed to the use of art therapy assessments, and have suggested that efforts to link formal elements in artwork with psychiatric diagnosis be abandoned (Gantt & Tabone, 1998). These individuals fear that rigid, reductionistic classification robs artwork of its

29

uniqueness and meaningfulness and suggest that there are other ways to look at art. Wadeson contended that art therapists reliance on assessment instruments reflects a longing for magic (2002, p. 170). Suspicious of attempts to interpret artwork, Maclagan stated, if there is an art in this analytic work, then it is all to often a devious, detective art, concerned with un-doing what the pictorial image is composed of and weaving into it a web of its own devising (1989, p. 10). Kaplan asserted, a wealth of information can be gathered just by discussing the art with the client (2003, p. 33), thus inferring that open discussion might elicit information to supplement the artwork. Some believe that assessments should be conceptually-based, deriving constructs such as attachment theory, development theory, etc. (D. Kaiser, personal communication, May 9, 2004). These individuals assert that focus ing on theory is a more strengths-based framework for understanding a client in a way that is systematically and contextually informed: Discerning strengths, developmental position, and attachment security while considering gender, culture, family form, etc., of the client seems more fitting for shaping art therapy interventions for therapeutic change. This school of thought is diametrically opposed to the position held by those who value the medical model and who are tied to the DSM, such as Barry Cohen and Anne Mills (DDS authors) and Linda Gantt and Carmello Tabone (PPAT and FEATS researchers). McNiff (1998) further anchored the stance that formal assessment methods, whether theoretically or medically based, are ineffective: Searches for pathology in artistic expression will inevitably lead to futile attempts to categorize the endless variations of expression. The primary assumptions of art psychopathology theories are unreliable since, as Prinzhorn shows, emotionally troubled people are capable of producing wondrous artworks which compare favorably with the creations of artists and children. These analyses of human differences are incessantly variable whereas the search for more positive and universal outcomes of artistic activity suggests that mentally ill people can use creative expression to transform and overcome the limitations of their conditions. (pp. 99-100) McNiff believed that categorization of pathological elements in art is a futile direction to pursue. Is there a more acceptable direction to be taken? Can a compromise between the two camps be achieved? In looking to the future, and identifying

30

recommendations of previous researchers, perhaps there are ways to improve and advance this area. Avenues For Improving Art Therapy Assessments Practitioners in a variety of mental health professions have developed and used assessment tools, and there is every reason to believe that they will continue to do so. The various issues and problems point to a need for improvement of art therapy assessment and research education, and of methodologies for developing new tools, existing assessments, and rating procedures. It has been suggested that art therapists study the problems thoroughly and learn from previous mistakes (Gantt, 2004). Furthermore, some believe that valid and reliable art-based assessments and rating instruments can and must be developed by art therapists. D. Kaiser (personal communication, May 9, 2004) asserted that art therapists are a pluralistic group and that this will serve them well in the long run, providing that they can end the debate and accept the range of approaches that have evolved. Furthermore, the use of art in assessment is still in its infancy, and even though valid and reliable tools have yet to be developed, this should not prevent such tests from being created (Hardiman, Liu, & Zernich, 1992). Klopfer and Taulbee recommended that research on projective tests account for whether the variable being investigated is symbolic, conscious, or behavioral, and that until this happens, such investigations will be like comparing walnuts with peaches and coming up with little other than fruit salad (1976, p. 544). It was further suggested that acute and chronic phenomena, state and trait phenomena, and behavioral and symbolic characteristics be distinguished from one another. The authors stressed that behavior should be the focus of personality assessment, since this is usually the reason that a patient is referred for treatment. Self-concept was also identified as a quality that should be evaluated, since it has an impact on an individuals decision- making process and behavior. Finally, it was suggested that symbolic or private personality traits also be examined, since people are often motivated by their unconscious drives. Public forums such as conferences and egroups have enabled art therapists to discuss the various problems surrounding assessment, and to identify potential solutions. At the annual national art therapy conferences in recent years, three panel presentations about assessment have been presented (Cox, Agell, Cohen, Gantt, 1998, 1999; Horovitz et al., 2001). The Cox et al. (1998) presentation provided a review of the UPAP, the DDS, the PPAT and FEATS

31

instruments. Samples of each of these protocols that were completed by patients with specific disorders (psychotic, mood, personality, and cognitive) were shown. This prompted a stimulating discussion between the panelists and the audience about art therapy assessment, and generated interest in a follow- up panel the subsequent year. Thus, in 1999, Cox et al. came together again to address the uniqueness of each procedure more specifically. The goal was to demonstrate when, where, and with whom any one of the three instruments would be most appropriate to use. Slides of UPAP, DDS, and PPAT drawings collected from one patient with schizophrenia, one with major depression, and one with a cognitive disorder, were displayed. The panelists then compared and contrasted the different outcomes of each of the protocols as they pertained to each patient. This provided the audience with a unique opportunity to learn about and discuss the benefits and weaknesses of each assessment directly with the art therapists who designed and/or developed the actual tools. Other attempts to promote understanding of art therapy assessments and research include an informal survey of assessment use in child art therapy (Mills & Goodwin, 1991); a literature review of projective techniques for children (Neale & Rosal, 1993); and a small-scale weighted analysis of published studies on art therapy assessment research (Hacking, 1999). Mills and Goodwin (1991) distributed surveys at a national art therapy conference to determine how art therapists use assessments with children. The 37 of 100 questionnaires that were returned revealed that participants were more familiar with projective tools than with art therapy assessments. Most respondents indicated that they preferred instruments that relied on modifications of existing art therapy techniques and projectives, and unpublished assessments. The authors concluded that there is a vast diversity among art therapists in training and approach to assessments, combined with a keen desire to innovate. Neale and Rosal (1993) reviewed and evaluated 17 empirical studies on the subject of projective drawing techniques (PDTs) published between the years 1968 and 1991. Studies were grouped by the type of the test used: human figure drawings (HFDs), House-Tree-Person (HTP), kinetic family and school drawings, and idiosyncratic PDTs. HFDs were found to be reliable as a predictor of the performance of learning-related behaviors and as a measure of learning disabilities. The HTP was free of cultural bias. The kinetic family drawings were found to have solid concurrent and test-retest reliability, while the kinetic school drawings had strong concurrent validity when correlated with achievement measures. Idiosyncratic PDTs were found

32

to be the weakest of the tests. The authors identified four research methods that improved the rigor of the studies (p. 47): (1) The use of objective criteria on which to score variables; (2) The establishment of interrater reliability; (3) The collection of data from a large number of subjects; (4) The duplication of data collection and appropriate analysis procedures to establish effectiveness and reliability of previously studied projective drawing instruments. It was suggested that adoption of these four methods would help art therapists to improve the quality of assessment research. In order to overcome some of the more theoretical and philosophical issues with artbased instruments, Gantts (1986), Burts (1996), and McNiffs (1998) views may provide some direction. Gantt (1986) examined alternative models for research design and strategies from the fields of anthropology, art history, and linguistics, and suggested that these may have useful implications for art therapy research methods because they concentrate on human behavior and the products of human behavior (i.e., art, culture, language). An American-based doctoral program recently mandated that its research curricula be updated to include the teaching of historical, linguistic, feminist, artistic, and other modes of disciplined inquiry (S. McNiff, personal communication, May 10, 2004). Burt (1996) emphasized qua litative approaches, which she considered to be more closely related to postmodern feminist theory. Gantt (1986) contended that in order to understand clients more fully, their literary and visual traditions and their cultural rules must be considered in addition to their intra-psychic processes. McNiff (1998) shared a similar view, asserting that research methods that engage both artistic and interpersonal phenomena need to be identified, since the art therapy relationship is a partnership between these elements. McNiff said that it is important to consider the total context of what a person does, and not to base an evaluation strictly on an interpretation of isolated images (1998, p. 119). Kaplan (2003) also asserted that both the clients reactions to engaging in art- making, coupled with interpretation of global features of the art, can produce significant findings. An alternative to examining artwork through the traditional approaches is the phenomenological device of bracketing, which involves the withholding of judgment when

33

approaching objects of inquiry (McNiff, 1998, p. 118). Rubin (1987) described a similar approach that of looking at the experience of art therapy, until central themes begin to emerge with an openness that does not categorize. Phenomenological approaches have been used by Quail and Peavey (1994) and Fenner (1996). Quail and Peavy (1994) presented a case study of an adult female and described a subjective experience as it is lived approach to the art therapy experience. Over the course of a 16-week art therapy group, the subject described the process and her feelings through five unstructured interviews. Quail and Peavy used Colaizzis (1978) method of extracting significant statements and thereby revealed the subjects progression from preintentional experiencing, to a fully intentional relationship with the object and patterns, to the formation of meaning in the artmaking process. In Fenners (1996) study, the client was the researcher, and art therapy was the subject. Over a period of approximately two months, the client engaged in brief drawing experiences of five minutes per sitting in order to determine whether personal meaning would be enhanced and therapeutic change would be achieved. In employing a phenomenological approach, both the Quail and Peavy and Fenner studies offer an alternative method of evaluating clients, one that is different from the traditional, empirical approach. The use of portfolio review for assessment purposes in art education has useful implications for art therapy (McNiff, 1998). Art educators typically review their students artworks over the course of the school year, in order to determine skill and assign a grade. The portfolio review in art therapy entails the amassing of artworks created by the client, which allows for tracking of changes in the artwork over time, and for common themes to emerge. McNiff suggested that the review could be beneficial because it provides for a comprehensive amalgamation of the clients interpretations of his or her own artwork, assessments, and transcripts of sessions. Adapting the art education format of portfolio review and assessment for use in art therapy would enable the therapist to gain a more comprehensive sense of the clients presenting problems, evidence of progress, etc. This method would be most appropriate for use in long-term treatment settings, where clients could amass their art therapy products over the course of time. Computer Technology and the Advancement of Assessment Methods According to the National Visual Art Standards, The art disciplines, their techniques, and their technologies have a strong historic relationship; each continues to shape and inspire the

34

other (National Art Education Association, 1994, p. 10). It is believed that existing and emerging technologies influence art education due to its dependence on art media (Orr, 2003), and the same is likely true for art therapy. Art therapists are using technology in ways that are likely to advance the area of assessment. The increasing popularity and user friendliness of computer technology is making the digital storage of client artwork more practical for art therapists: Computer technology will revolutionize possibilities for creative analysis, presentation, and communication of art therapy data (McNiff, 1998, p. 203). Art therapist researchers Linda Gantt and Carmello Tabone are developing a website for the PPAT and FEATS (L. Gantt, personal communication March 10, 2004). The site will provide information about the PPAT assessment and FEATS manual, will make rating sheets and related materials available for downloading, and will enable art therapists to enter data from the PPATs they collect. Only those researchers based in America who demonstrate that they are accumulating the PPATs and adhering to the FEATS scoring instructions will be permitted to enter their data. This will help to ensure the development of a representative sample that could be used for norming purposes. Analyzing The Research on Art Therapy Assessment Instruments Another avenue to improve art therapy assessment instruments would be to increase researchers comprehension of former investigations and provide recommendations for improvements. A synthesis of the existing research, such as the present study, is an effective way to achieve this end. Since there are currently no published systematic analyses or comprehensive literature reviews of research on art therapy assessments or rating instruments, an exploration of reviews and analyses in the related fields of psychology and creative arts therapies is warranted. Syntheses of psychological assessment research. Meta-analysis techniques have been used in many studies to synthesize research on psychological assessment (Acton, 1996; Garb, 2000; Garb, Wood, Nezworski, Grove, & Stejskal, 2001; Hiller, Rosenthal, Bornstein, Berry, & Brunell-Neuleib, 1999; Meyer & Archer, 2001; Parker, Hanson, & Hunsley, 1988; Rosenthal, Hiller, Bornstein, Berry, & Brunell-Neuleib, 2001; Spangler, 1992; Srivastava, 2002; West, 1998). Of particular interest is Actons (1996) research involving three studies that were carried out to examine the empirical validity of individual features of human figure drawings as measures of specific forms of psychopathology. This is a model study because unlike several

35

comprehensive and influential past reviews, it grouped findings by construct and employed techniques to determine effect sizes and test their significance. In addition, Actons second study used the results of the previous meta-analysis to develop drawing scales for four specific constructs of psychopathology: anger/hostility, anxiety, social maladjustment, and thought disorder. Finally, the third study in Actons research was a full replication of the second using a new sample of young offenders, and is relevant because the results suggest some potential for aggregates of individual drawing features to provide valid measures of specific forms of psychopathology. Some of the studies identified in the present literature search provide information about the application of specific meta-analytic techniques. For example, in Rosenthal et al.s (2001) study of meta-analytic methods, the Rorschach, and the MMPI, the authors asserted that research synthesists must compute, compare, and evaluate a variety of indices of central tendency, and they must examine the effects of (mediator) variables (p. 449). Other useful elements in this article include commentary on the use of Kappa versus phi, combining correlated effect sizes, and possible hindsight biases. Garb (2000) reanalyzed Wests (1998) meta-analytic data on the use of projective techniques, including the Rorschach test and HFDs, to detect child sexual abuse. West had located 12 studies on detecting sexual abuse and 4 studies on detecting physical abuse, and excluded nonsignificant results. In reanalyzing the data from Wests 12 studies on sexual abuse, Garb calculated new effect sizes using the nonsignificant and significant results. In many of the studies it was found that none of the projective test scores had been well replicated, and that many of those that had reported validity were actually flawed. It was concluded that projective techniques should not be used to detect child sexual abuse. In 2001, Garb et al. wrote a commentary about the articles that encompassed the first round of the Special Series on the Rorschach. Viglione (1999) and Stricker and Gold (1999) failed to cite negative findings and praised the Rorschach. Although one of Dawes (1999) data sets was flawed, he obtained results that provided modest support for the Rorschach. Hiller et al. (1999) reported the results of a meta-analysis, but there were problems, including the fact that their coders were not blind to all of the studies results. Hunsley and Bailey (1999) found that there is no scientific basis for using the Rorschach, and provided ample support for this conclusion.

36

Hiller et al. (1999) cited the Atkinson, Quarrington, Alp, and Cyr (1986) and Parker et al. (1988) meta-analytic studies. Average validity coefficients for the Rorschach and the MMPI were found to have similar magnitudes, but methodological problems in both meta-analyses were thought to have impeded acceptance of these results (Garb, Florio, & Grove, 1998). Thus, Hiller et al. conducted a new meta-analysis comparing criterion-related validity evidence for the Rorschach and the MMPI. The unweighted mean validity coefficients (rs) were .30 for MMPI and .29 for Rorschach, and they were not reliably different (p = .76 under fixed-effects model, p = .89 under random-effects model). The Rorschach had larger validity coefficients than the MMPI for studies using objective criterion variables, whereas the MMPI had larger validity coefficients than the Rorschach for studies using psychiatric diagnoses and self-report measures as criterion variables. Authors of the final article in the Special Series on The Utility of the Rorschach for Clinical Assessment, Meyer and Archer (2001), provided a summary of this instruments current status. Global and focused meta-analyses were reviewed, including an expanded analysis of Parker et al.s (1988) data set. Rorschach, MMPI, and IQ scales were found to have greater validity for some purposes than for others, but all produced roughly similar effect size. Eleven salient empirical and theoretical gaps in the Rorschach knowledge base were identified. Parker, Hanson, and Hunsley (1988) located articles from the Journal of Personality Assessment and the Journal of Clinical Psychology between 1970 and 1981 on the MMPI, the Rorschach Te st, and the Wechsler Adult Intelligence Scale (WAIS). The average reliability, stability, and validity of these instruments was estimated. Validity studies based on prior research, theory, or both had greater effects than did studies lacking an empirical or theoretical rationale. The reliability and stability of all three tests was found to be approximately equivalent and generally acceptable. The convergent- validity estimates for the Rorschach and MMPI were not significantly different, but both of these were lower than was the WAIS estimate. The authors concluded that both the MMPI and Rorschach could be considered to have sufficient psychometric properties if used according to the purpose for which they were designed and validated. Two meta-analyses of 105 randomly selected empirical research articles on the TAT and questionnaires were conducted by Spangler (1992). Correlations between TAT measures of need for achievement and outcomes were found to be generally positive, and these were quite large for

37

outcomes such as career success measured in the presence of intrinsic, or task-related, achievement incentives. Questionnaire measures of need for achievement were also found to be positively correlated with outcomes in the presence of external or social achievement incentives. On average, TAT-based correlations were found to be larger than questionnaire-based correlations. Srivastava (2002) conducted a meta-analysis of all the studies on Somatic Inkblot Series (SIS-I) published in the Journal of Projective Psychology and Mental Health from 1994-2001. The purpose was to provide normative data by combining means and standard deviations of existing studies and to determine whether SIS-I indices could differentiate various groups. For intergroup comparison, critical ratios were computed on combined mean and standard deviation. The comparison groups were in fact significantly differentiated by the SIS-I indices. In 1998, West meta-analyzed 12 studies to assess the efficacy of projective instruments in discriminating between sexually abused children (CSA) and non-sexually abused children. The Rorschach, the Hand Test, the TAT, the Kinetic Family Drawing, the Human Figure Drawing, Draw Your Favorite Kind of Day, the Rosebush: A Visualization Strategy, and HTP were reviewed. An over-all effect size was determined to be d = .81. Six studies included a clinical group of distressed non-sexually abused subjects and the effect size lowered to d = .76. The remaining six studies used a norm group of nonabused children with the sexually abused group, and the average effect size was d = .87. These effect sizes indicated that projective instruments could effectively discriminate distressed children from those who were non-distressed. Although most assessment tools seem to be able to differentiate between a normal group of subjects and subjects who experienced sexual abuse during childhood, it could only be inferred that an instrument is able to detect nondistress from some type of distress. The inclusion of the clinical group with no history of sexual abuse generated the necessary data to support the assertion that an instrument can discriminate a CSA subject from other types of distressed subjects. The outcomes are in the medium to large range, despite the fact that the inclusion of the clinical group tended to result in a lower effect size to discriminate the CSA subjects. The lower ranges of power that resulted from inclusion of clinical groups were attributed to the fact that symptoms often associated with CSA become evident in clinical disorders. Syntheses of creative arts therapies assessment research. Several investigators in the creative arts therapies have applied meta-analytic techniques to bodies of assessment research

38

(Conard, 1992; Hacking, 1999; Hacking, 2001; Loewy, 1995; Oswald, 2003; Ritter & Low, 1996; Scope, 1999; Sharples, 1992). Hackings work is the most directly relevant: in her analysis of art-based assessment research, she endeavored to identify the central importance of developing systematic, content-free assessments of psychiatric patients paintings (2001, p. 165). Hacking conducted an analysis of the literature that revealed the best repeatability in order to put this literature on equal footing. She used the resulting data to provide a rationale for developing the Descriptive Assessment for Psychiatric Art (DAPA). Hackings use of metaanalysis techniques established a foundation for a more comprehensive examination of the art therapy assessment research. Loewy (1995) found that music therapists were not making use of published music therapy assessment tools, and that the tools were being used to measure music primarily in educational or behavioral terms, rather than to incorporate music as part of a psychodynamic relationship. In an effort to further understand why music therapy assessments were being used in this way, Loewy studied psychotherapeutic practices with emotionally handicapped children and adolescents, and employed a hermeneutic inquiry into a music psychotherapeutic assessment experience. A panel of five prominent music psychotherapists viewed a 50- minute music psychotherapy first session assessment video and developed written assessment reports. Each panelist was interviewed for ninety minutes. The interviews were transcribed, systematized, and analyzed in conjunction with a preliminary analysis of the panel participants original reports. Loewy then applied meta-analysis techniques to determine the impact that a therapists musical background, orientation and training, and personal history have on the way that he or she assigns meaning to an initial music therapy experience. The final analysis yielded five categories of semantic agreement: (1) Affect Joy and Anxiety; (2) Structure Time, Basic Beat, and Boundary; (3) Approval-Seeking Behavior; (4) Creativity Improvisation, Investment/Intent, and Spontaneity; and (5) Symbolic Associations of Instruments. There were four categories of semantic difference: (1) Music; (2) Cognition; (3) Singing in Tune; and (4) Rhythmic Synchrony. Finally, the panel participants areas of specialization were noted: Theme, Success in the Child, Affect/Congruence, Transference/Countertransference and Horns. In 1992, Conard conducted a meta-analysis of research that examined the effect of creative dramatics on the acquisition of cognitive skills. The following areas were investigated: (1) the achievement of students involved in creative dramatics as compared to traditional

39

instructional methods; (2) the impact of sample and study characteristics on outcomes; and (3) the effects of methodology and research on outcomes. Each study was weighted independently, thus accounting for the variety in group size across studies. For studies in which creative dramatics was applied, a mean effect size of 0.48 was calculated. Creative dramatics was found to be more effective at the pre-school and elementary level than at the secondary level. Remedial and regular students appeared to enjoy participating in and benefit from creative dramatics. Studies that were carried out in private schools produced larger effect sizes than those that took place in public schools. The quantitative analysis was combined with qualitative reviews, and the qualitative data enhanced the results of the meta-analysis considerably. However, measurement characteristics such as validity and reliability, and other details of the dependent measures, were frequently excluded in the studies. Conard concluded that future research should include more detailed information about methodology, procedures, and how effects are measured. Oswald (2003) examined the popularity of meta-analysis in psychological research and presented information about techniques applicable to the arts in education. Instructions were provided to assist the researcher in computing the meta-analytic mean and how it should be interpreted. Guidelines for making statistical artifact corrections in a meta-analysis were discussed. The statistical power of meta-analysis was investigated with respect to detecting true variance across a set of study statistics once corrections have already been made. A set of conceptual issues was presented to address the detection of mediator effects across studies. Standardized effect sizes for case-control studies of dance/movement therapy (DMT) were calculated in Ritters and Lows (1996) meta-analysis of 23 studies. Summary statistics reflecting the average change associated with DMT were produced, and the effectiveness of DMT in different samples and for varying diagnoses was examined. It was determined that the methodological problems identified in the DMT research could be addressed with the use of standardized measures and inclusion of adequate control groups. DMT was found to be an effective treatment for a variety of patients, especially those coping with anxiety. It was further concluded that adults and adolescents benefit from DMT more than do children. In 1999, Scope conducted a meta-analysis of the literature on creativity to examine the effects of instructional variables (such as time spent on instruction, reviewing previous lessons, etc.) on increases in creativity in school-aged children. All accessible studies, including published and non-published, were located. The subjects ranged from preschoolers to high

40

school students. Instruction was found to have a positive effect on the childrens creativity, and there was a modest positive correlation between creativity and independent practice. However, the instructional variables of time spent on instruction, structuring, reviewing, questioning, and responding were not found to have an impact on creativity. Additional variables or combinations thereof might have caused the increases in creativity. An exploratory, qualitative review of three exceptional studies revealed that the most successful treatments were motivating for the subjects, were developmentally appropriate, had high treatment compliance, and had high levels of teacher-student interactions. Sharples (1992) reviewed 27 experimental studies published between the years 1970 and 1989 from the fields of art education and psychology, and conducted a qualitative meta-analysis in order to investigate the relationship between social constraints, intrinsic motivation, and creative performance. Conclusion The foundations of art therapy assessment instruments provided information about historical milestones in psychological evaluation that influenced the development of art therapy tools. The review of the literature on some of the first widely used tools indicated that the use of projective techniques and art-based instruments is questionable due primarily to problems of measurement validity and reliability. The section pertaining to the development of art therapy assessments summarized the influence of the first instruments designed by art therapists, and illuminated the significance of formal rating systems. Literature emphasizing the importance of art therapy assessments reflected the use of these tools in different settings to plan intervention or treatment and to evaluate results; the derivation of meaningful information about clients; and the importance of assessment in advancing of the field of art therapy. Issues with art therapy assessments were illustrated with citations from the literature on practical problems related to validity and reliability, as well as philosophical and theoretical concerns. These pointed to questions about whether and how art therapy assessments could be improved. There was some agreement in the literature that the quest for valid and reliable instruments should be pursued, and several suggestions for improvement were put forth. The use of computer technology and the application of meta-analysis techniques to examine the literature were suggested as possible avenues.

41

The information derived from analyses in the fields of psychology and the creative arts therapies was helpful in formulating ideas and determining appropriate methods and procedures for A Systematic Analysis of Art Therapy Assessment and Rating Instrument Literature. The next chapter details the methodology for the present study.

42

CHAPTER 3 METHODOLOGY Approach for Conducting the Systematic Analysis The purpose of this chapter is to discuss the methods that were used to address the research questions of the present research and the issues that were encountered. The overriding question of the present study was: what does the literature tell us about the current state of art therapy assessments? The three sub-questions were: (1) To what extent are art therapy assessments and rating instruments valid and reliable for use with clients?; (2) To what extent do art therapy assessments measure the process of change that a client may experience in therapy?; and (3) Do objective assessment methods such as standardized art therapy tools give us enough information about clients? The following stages were used as a guideline in the present study: problem formulation; the literature search stage; extracting data and coding study characteristics; data evaluation; and data analysis and interpretation (Cooper, 1998). Problem Formulation Problem formulation was considered before the present study was conducted. This involved selection of a topic and later specification of inclusion criteria: The literature search can begin with a conceptual definition and a few known operations that measure it. Then, as the synthesist becomes more familiar with the research, the concept and associated operations can grow more precise (Cooper, 1998, p. 14). In order to identify the research problem, it was necessary to gather a collection of studies at the outset (Cooper, 1998). Studies that were later found to be irrelevant were eliminated. This involved decision- making about which criteria to include in the analysis. Additional factors in literature selection were considered during the problem formulation stage (Cooper, 1998). These were: conceptual relevance, study-generated evidence, synthesisgenerated evidence, and the possible existence of previous meta-analyses.

43

Because studies must be judged to be conceptua lly relevant (Cooper, 1998), this researcher adhered to four primary factors that have been shown to influence the quality of such judgments: the researcher remained open- minded throughout the problem formulation stage (per Davidson, 1977); possessed expertise on the topic and consulted with other experts; based decisions on abstracts (per Cooper & Ribble, 1989); and dedicated a considerable amount of time (several months) to this process (per Cuadra & Katter, 1967), thereby expending every effort to ensure quality in judgments about conceptual relevance of primary studies. Study- generated and synthesis-generated refer to sources of evidence about relationships within research syntheses. These were considered during problem formulation. Study- generated evidence is present when a single study contains results that directly test the relation being considered. Synthesis- generated evidence is present when the results of studies using different procedures to test the same hypothesis are compared to one another (i.e., metaanalysis) (Cooper, 1998, p. 22). When using synthesis- generated evidence to study descriptive statistics or bivariate relationships (and the corresponding interactional hypothesis), this researcher was alert to Coopers cautioning that social scientists often use different scales to measure variables, because different scales make it difficult to test bivariate relationships or to aggregate descriptive statistics. When a subject has been widely researched, it is likely that previous meta-analyses on that topic already exist (Cooper, 1998). Syntheses conducted in the past can provide a basis for creating a new one. As such, this author located one previous meta-analysis on the present topic: Hackings (1999) analysis on art-based assessment research. Hacking conducted a meta-analysis of drawing elements (such as color, line, and space). For the present study, the data available on individual drawing elements was too limited to warrant the application of metaanalysis techniques. However, Hacking did not conduct a meta-analysis of concurrent validity and inter-rater reliability statistics, items that were determined to be important in addressing this authors research questions. Criteria for selection of art therapy assessment studies. Between February of 2004 and February of 2005, unpublished and published sources were located for inclusion in the present study. The following criteria were used for the selection of primary studies: ? Papers from the field of art therapy only (i.e., no psychological art-based assessments, such as the Draw A Person, House-Tree-Person, etc.).

44

? ? ? ?

Assessments that involve drawing only (i.e., any tests that involve response to an external stimulus were excluded). Assessments that require the subject to complete no more than three drawings (i.e., tools that encompass a battery of more than three drawings were excluded). Features of art rating scales that measure formal elements (as opposed to picture content, thus content checklists were excluded). Studies written in English; and studies conducted within the last 32 years (since 1973).

These criteria were followed upon initiation of the literature search stage. The Literature Search Stage For this stage, every effort was made to locate and retrieve published and unpublished studies on art therapy assessments and rating instruments, in an effort to ensure that the cumulative outcome of the present study would reflect the results of all previous research (per Cooper, 1998). To protect against threats to validity in the literature review stage, Coopers (1998) guidelines were used: (a) conduct a broad and exhaustive literature search, (b) provide a detailed description in the research paper about how studies were gathered, (c) present indices of potential retrieval bias (if available), such as an examination of whether any difference exists in the results of published versus unpublished studies, and (d) summarize the sample characteristics of individuals used in separate studies. Locating studies. As recommended by Cooper (1998), three methods for locating primary studies were employed for the present study: informal channels, formal methods, and secondary channels (per Cooper, 1998). Specifically, for the present study, four principal types of informal channels were used: (a) personal contacts, (b) solicitation letters (sent to colleagues via email), (c) electronic invisible colleges (i.e., networks of arts therapists who share information with each other via internet listserves), and (e) the World Wide Web. Formal techniques for locating studies were also used: (a) professional conference paper presentations, (b) personal journal libraries (i.e., the authors personal collection: The American Journal of Art Therapy; Art Therapy, Journal of the American Art Therapy Association; The Arts in Psychotherapy), and (c) research report reference lists (i.e., reviews of research reports already acquired) (Cooper, 1998). Awareness was maintained of the potential for peer review and

45

publication biases when using personal journal libraries. Specifically, The scientific rigor of the research is not the sole criterion for whether or not a study is published. Most notably, published research is biased toward statistically significant findings (Cooper, 1998, p. 54). In additio n to this prejudice against the null hypothesis, another source of bias that impacts publication was considered: collectively confirmatory biases (Nunnally, 1960). Specifically, findings that conflict with the prevailing beliefs of the day are less likely to be submitted for publication, and are less likely to be selected for publication, than research which substantiates currently held beliefs. The secondary channels for locating studies employed in the present study included bibliographies and reference databases. The bibliographies consisted of those previously prepared by others (such as an unpublished handout of the DDS bibliography). The reference databases used to locate studies by this author were: Cambridge Scientific Abstracts and PsycInfo. Thirty-nine studies were initially located. Of these, four were excluded: Bowyer (1995) (was unable to be mailed); Mills, Cohen and Meneses (1993a & 1993b) (is a review of previous studies); Rankin (1994) (is a review of previous studies); and Teneycke (1998) (was unable to be located). The remaining 35 studies were retained for inclusion in the final analysis (Appendix J). Extracting Data and Coding Study Characteristics Once a satisfactory number of primary studies were found, the synthesist perused each of these. The next step was to design a coding sheet. The research synthesis coding sheet. A coding sheet is used by the synthesist to systematize information from the primary research reports (Cooper, 1998). Abiding by the procedures set forth by Cooper, this researcher began constructing a coding sheet by predetermining some of the data that would be extracted, and then put together a draft coding sheet. The coding sheet was further solidified after the studies were read. When constructing the coding sheet, all potentially relevant information was retrieved from the studies. Stocks (1994) criteria for coding sheet design were also followed. Specifically, six general categories were incorporated: report identification, citation information, research design, site variables, participant variables, and statistical information. A category for noting study quality was also included, and space was made available on the sheet for descriptive notes. Please refer to Appendix I for the coding sheet that was actually used to code the primary studies.

46

Categorizing research methods. The synthesist must decide what methodological characteristics of studies need to be coded (Cooper, 1998, p. 84). While the synthesist should code all potentially relevant, objective aspects of research design (p. 88), there are threats to validity that may not be captured by this information alone. As such, the mixed-criteria approach, known as the optimal strategy for categorizing studies, was employed for the present analysis. This approach is actually a combination of two a posteriori methods: (a) the threats-tovalidity approach, wherein judgments must be made about the threats to validity that exist in a study; and (b) the methods-description approach, wherein the objective design characteristics of a study, as described by the primary researcher, must be detailed. As Cooper (1998) suggested, any limits on the types of individuals sampled in the primary studies were recorded, along with information about where and when the studies were conducted. In addition, it was noted when the dependent variable measurements were taken in relation to measurement or manipulation of the independent variables. In order to assess the statistical power of each study, the following items were recorded: (a) the number of participants; (b) the number of other factors (sources of variance) extracted by the analyses; and (c) the statistical test used (Cooper, 1998). Testing of the coding sheet. In order to ensure reliable codings, per Coopers (1998) recommendations, this synthesist adhered to the rules for developing a thorough and exhaustive coding sheet, described previously in this chapter. The researcher met with the coders to discuss the process in detail and to review practice examples. An initial draft of the coding sheet was created by this author, and then feedback was solicited from colleagues. Following this step, three studies were selected randomly (Couch, 1994; Kress & Mills, 1992; Wilson, 2004) in order to pilot-test the sheet. Three people served as coders for the pilot-test. This researcher served as a coder (Coder 1), and trained two individuals to code the three studies: Coder 2 was an art therapist and doctoral student in the FSU Art Education department who was blind to the study, and Coder 3 was the authors Major Professor, who was not blind to the study. The coders met with the researcher to determine the rate of inter-coder agreement. This was calculated to be 73% and it was determined that the pilot coding sheet needed to be revised. In revising the pilot coding sheet, additional categories were incorporated and category descriptors were defined with more precision, as suggested by Cooper (1998). A second and final test of the revised coding sheet was conducted. The coders were provided with three new studies

47

that were randomly selected from the pool of 35 and were trained by the researcher in how to use the revised coding sheet. The primary studies coded for this second test were: Batza (1995), Cohen & Heijtmajer (1995), and McHugh (1997). Inter-coder agreement among these coding sheets was 100%, thus the coding sheet was deemed appropriate for use in the present study. As recommended by Cooper (1998), coder reliability was checked on randomly chosen studies once coding was in progress. Coders 1 and 2 coded three papers: Francis, Kaiser & Deaver (2003), Gulbro-Leavitt & Schimmel (1991), and Neale (1994). Inter-rater agreement was calculated to be 100%. Data Evaluation During this stage, the author critically assessed data quality: it was determined whether the data were contaminated by factors that were irrelevant to the central problem (Cooper, 1998). Specifically, decisions were made about whether to keep separate or to aggregate multiple data points (correlations or effect sizes) from the same sample, which involves independence and nonindependence of data points. In addition, the analyst looked for errors in recording, extreme values and other indicators that suggest unreliable measurements. The size of relationships or treatment effects was also investigated. Criteria for judging the adequacy of data- gathering procedures was established in order to determine the trustworthiness of individual data points (per Cooper, 1998). Then, any primary studies found to be invalid or irrelevant to the synthesis were either discarded (a discrete decision), or weighted differently depending on their relative degree of trustworthiness (a continuous decision) (p. 79). Identifying independent comparisons. As delineated by Cooper (1998), each statistical test was coded as a discrete event: studies included in the final analyses (described later) had two main statistical comparisons, concurrent validity and inter-rater reliability. Thus, two separate Microsoft Excel data sheets were made for each pertinent study. This approach, knows as the shifting- unit of analysis, ensures that for analyses of influences on relationship strengths or comparisons, a single study can contribute one data point to each of the categories distinguished by the (mediating) variable (p. 100). A shifting unit of analysis is recommended as an effective approach, because, although it can be confusing, it allows studies to retain their maximum information value while keeping to a minimum any violation of the assumption of independence of statistical tests (p. 100).

48

Data Analysis and Interpretation The following data analysis steps were followed and are reported in Chapter Four: (a) simple description of study findings (Glass et al., 1981); (b) correlating study characteristics and findings; (c) calculating mean correlations, variability, and correcting for artifacts (Arthur et al., 2001); (d) deciding to search for mediators (e.g., subject variables); (e) selecting and testing for potential mediators; (f) linear analysis of variance models for estimation (Glass et al., 1981); and (g) integrating studies that have quantitative independent variables. Statistical procedures were used to interpret the data, so that systematic patterns could be distinguished from chance fluctuations (Cooper, 1998). Studies were measured by means of a correlation coefficient and an effect size (per Glass et al., 1981). Then, methods of tabulating and describing statistics were applied: averages, frequency distributions, measures of variability, and so on. These data are reported in Chapter Four of the present study. The test-criterion relationship was expressed as an effect size for each study included in the analysis (AERA, 1999). Since the strength of this relationship was found to vary according to mediator variables (such as the year in which data were collected, whether studies were published or not, etc.), separate estimated effect size distributions for subsets of studies were computed, and magnitudes of the influences of situational features on effect sizes were estimated. Coopers (1998) stages served as a model for determining the methods and procedures that were necessary to conduct the systematic analysis of art therapy assessment research and manage the issues that were encountered. The next chapter reports the present studys results.

49

CHAPTER 4 RESULTS Of primary interest in the present study is to uncover information about the current state of art therapy assessments and rating instruments pertaining especially to validity and reliability. Particular methods were applied in order to address this studys research questions, as was outlined in Chapter Three. Specifically, the researcher gathered both descriptive data and computed synthesis outcomes in an effort to reveal (a) to what extent art therapy assessments and rating instruments are valid and reliable for use with clients; (b) to what extent therapy assessments measure the process of change that a client may experience in therapy; and (c) whether objective assessment methods such as standardized art therapy tools give us enough information about clients. Thus, in this chapter, descriptive results on the 35 primary research papers selected for the present study are described (please refer to Appendix J for a bibliography of the 35 papers and the method of study location). Detailed information about the inter-rater reliability of the primary studies is provided, including meta-analysis results and an examination of potential mediating variables. Similarly, concurrent validity data derived from some of the primary studies are examined, consisting of meta-analysis results and an examination of potential mediators. Descriptive Results In this section, the following results are described: citation dates; citation types; art therapy assessment types; patient group categories; and rater tallies. Citation dates. Out of 35 total studies used in the analysis, 16 (45.71%) were classified as published and 19 (54.28%) were unpublished. Figure 1 illustrates frequency data on the citatio n dates for each of the 35 studies.

50

Citation Date 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Frequency

Total 1

1 2 2 1 1 2 3 3 2 2 2 1 3 3 1 4

Figure 1. Citation Date Histogram

Citation types. Six types of citations are used in this study. The frequencies are shown in Figure 2, and are tallied as follows: 15 journal articles (42.86% of total studies); nine masters theses (25.71%); five dissertations (14.28%); four unpublished papers (11.43%); one bachelors thesis (2.86%), and one book chapter (2.86%).

51

Citation Type Journal article Master's thesis Dissertation Unpublished paper Bachelor's thesis Book chapter

Frequency Figure 2. Citation Type

Total 15 9 5 4 1 1

Art Therapy Assessment Types. Eight categories of Art Therapy Assessment Type were tabulated in this study, and frequencies are: 20 DDS studies (57.14%); four Birds Nest Drawing (BND) studies (11.43%); four PPAT studies (11.43%); two studies were not applicable for this category as they employed rating scales only (5.71%); two studies analyzed spontaneous art (5.71%); one used the A Favorite Kind of Day (AFKOD) tool (2.86%); one employed the Bridge Drawing (2.86%), and the DAPA was used in one article (2.86%). Figure 3 displays these results.

Art Therapy Assessment Type DDS/CDDS BND PPAT N/A Spontaneous art A Favorite KOD Bridge Drawing DAPA

Frequency

Total 20 4 4 2 2 1 1 1

Figure 3. Art Therapy Assessment/Rating Instrument Type

52

The majority of papers in the present study examined the DDS (57.14%), thus a substantial amount of information about this tool was gathered. For example, seventeen studies provided numerical data on specific DDS Drawing Analysis Form (DAF) variables, which may be of interest to the reader (please see Appendix K). Patient group categories. Fifty-eight total patient groups were identified in the 35 studies analyzed by this author. They are classified as follows: Major Mental Illnesses (Bipolar Disorders, Depressive Disorders, Dual/Multiple Diagnoses, Dissociative Disorders, Schizophrenia and Other Psychotic Disorders) (27 studies, 77.14% of total); Mental Retardation (two, or 5.71%); Disorders of Childhood (Adjustment Disorders, Attachment, Communication Disorders, Conduct Disorders, Problems Related to Abuse or Neglect, SED) (12, or 34.28%); Eating Disorders, Personality Disorders and Substance-Related Disorders (nine, or 25.71%); Brain Injury and Organic Mental Syndromes and Disorders (three, or 8.57%), and Unspecified Diagnosis/Miscellaneous and Normal (five, or 14.28% out of 35 total studies). Table 2 illustrates the tally of these groups.

Table 2. Patient Group Categories Patient Group Category Major Mental Illnesses Citations Batza 1995, Brudenell 1989, Coffey 1997, Cohen 1988, Cohen95, Easterling00, Fowler 2002, Gantt 1990, Gulbro 1988, Gulbro 1991, Hacking 1996, Hacking 2000, Johnson 2004, Kress 1992, Mchugh 1997, Mills 1993, Ricca 1992, Shlagman 1996, Wadlington 1973 Batza 1995, Gantt 1990 Coffey 1997, Francis 2003, Hyler 2002, Kaiser 1993, Manning 1987, Neale 1994, Overbeck 2002, Shlagman 1996, Wadlington 1973, Wilson 2004, Yahnke 2000 Bergland 1993, Billingsley 1998, Eitel 2004, Hacking 1996, Hacking 2000, Kessler 1994, Mills 1989 Couch 1994, Gantt 1990, Hacking 1996 Eitel 2004, Gussak 2004, Hays 1981, Neale 1994, Wadlington 1973 Frequency Total 27

Mental Retardation Disorders of Childhood

2 12

Eating Disorders; Personality Disorders; SubstanceRelated Disorders Brain Injury; Organic Mental Syndromes and Disorders Unspecified Diagnosis/Miscellaneous, Normal

9 3 5

53

Nine studies had two or more patient group types: Batza 1995 (two groups), Coffey 1997 (two groups), Eitel 2004 (two groups), Gantt 1990 (three groups), Hacking 1996 (three groups), Hacking 2000 (two groups), Neale 1994 (two groups), Shlagman 1996 (two groups) and Wadlington 1973 (three groups). Rater tallies. As is shown in Table 3, most studies used three people to rate artwork (12, or 34.28% out of 35 total studies); six (or 17.14%) studies used one rater; five studies used two raters (or 14.28%); two (or 5.71%) studies used four raters; another two (5.71%) used seven raters; one (2.86%) used six raters; another study (or 2.86%) used five raters; one (or 2.86%) used 86 raters, and for five (14.28%) studies the number of raters could not be determined.

Table 3. Number of Raters Number of Raters 1 2 3 Citations Coffey 1997, Cohen 1988, Gussak 2004, Kessler 1994, Kress 1992, Mills 1989 Couch 1994, Mchugh 1997, Shlagman 1996, Wilson 2004, Yahnke 2000 Billingsley 1998, Brudenell 1989, Fowler 2002, Francis 2003, Gantt 1990, Hyler 2002, Johnson 2004, Kaiser 1993, Manning 1987, Neale 1994, Overbeck 2002, Ricca 1992 Gulbro 1988, Wadlington 1973 Cohen 1995 Bergland 1993 Hacking 1996, Hacking 2000 Eitel 2004 Batza 1995, Easterling 2000, Gulbro 1991, Hays 1981, Mills 1989 Frequency Total 6 5 12

4 5 6 7 86 Cant Determine

2 1 1 2 1 5

Inter-Rater Reliability Nineteen of the 35 studies computed inter-rater reliability to determine the proportion of agreement among the individuals who rated assessment drawings (Table 4). Eitel (2004) was excluded from the inter-rater reliability analysis because the procedures used in that study were unclear. A variety of statistical measures for inter-rater reliability were reported in the remaining

54

18 studies. These included: kappa, percentage, and correlation (intra-class correlation, r, rho, or alpha). The type of statistic and numerical result for each study are provided. The kappa statistic was used in seven studies, five studies reported percentages for inter-rater reliability, and correlation was used on nine occasio ns. Three studies that employed more than one method of calculating inter-rater reliability (Fowler 2002, Johnson 2004, and Mchugh 1997) are shown twice in the table.

Table 4. Inter-Rater Reliability Citation Bergland 1993 Billingsley 1998 Fowler 2002 Fowler 2002 Francis 2003 Gantt 1990 Gulbro 1988 Hacking 1996 Hacking 2000 Hyler 2002 Johnson 2004 Johnson 2004 Kaiser 1993 Manning 1987 Mchugh 1997 Mchugh 1997 Neale 1994 Overbeck 2002 Shlagman 1996 Wilson 2004 Yahnke 2000 Global Effect Size 0.607 0.625945 0.89999324 0.90491 0.75 0.75 0.82 0.853 0.34 0.9 0.805 0.944 0.91 0.975 0.74 0.95 0.97 0.992 0.66 0.79 0.567 0.842 0.9 Kappa Percentage Correlation* 0.91

* Intra-class correlation, Pearsons r, Rho, or Alpha

55

Meta-analysis of inter-rater reliability. A global effect size was computed for each of the three inter-rater reliability categories (kappa, percentage and correlation). In meta-analysis, global effect sizes for kappas and percentages of agreement are computed via the weighted average method. For correlations (r), Fishers Z is used to transform r so that a global effect size can be obtained. For the studies that used kappa, a weighted average was calculated by using the sample size for each study. Specifically, the sum was calculated (each sample size multiplied by each kappa) then divided by the sum of each sample size, resulting in a global effect size of 0.63 (Table 5). The non-weighed average of all kappas was 0.64. For the studies that reported percentage agreement, the average of percentages of all studies was 0.89 (Table 6). The weighed average percentage of all studies was computed as follows: sum (percentage of each study multiplied by corresponding sample size) divided by sum (sample size of each study), resulting in a global effect size of 0.9. Whereas weighted averages were computed to obtain global effect sizes for the kappa and percentage agreement statistics, Fishers Z was used to compute global effect size for interrater reliability correlations. Nine studies qualified for the meta-analysis of inter-rater reliability results (Bergland 1993, Gantt 1990, Hacking 1996, Hacking 2000, Johnson 2004, Kaiser 1993, Manning 1987, Mchugh 1997, Wilson 2004). The global effect size (Fishers Z) for these studies was 0.91 (Appendix L). A sensitivity analysis was conducted in which two small studies were excluded (Hacking 1996, N=8; Hacking 2000, N=8). Meta-analysis results are attainable even with an N as small as seven (Q. Wang, persona l communication, March 18, 2005). Thus, the global effect size of the remaining seven studies was calculated, and 0.9 was the resulting figure (Appendix M). This may indicate that inter-rater reliability in art therapy assessment research is higher when correlations are used. The global effect size was calculated in order to determine the degree to which the null hypothesis was false. The number was computed to be very high, 0.91, revealing that there is a great deal of variability among primary study results. To determine the cause of this variability (heterogeneity), potential mediator variables were examined: rater trained vs. not trained; primary author served as rater vs. did not serve as a rater; coder found study supported test vs. coder neutral on whether study supported test; study published vs. study not published. The

56

original nine studies were retained for the examination of potential mediators (i.e., sources of variability).

Table 5. Kappa Effect Size Citation Fowler 2002 Francis 2003 Hyler 2002 Mchugh 1997 Neale 1994 Overbeck 2002 Yahnke 2000 TOTALS GLOBAL EFFECT SIZE 0.63985714 (non-weighted) Kappa 0.567 0.66 0.805 0.34 0.75 0.75 0.607 N (Study Kappa*N Sample Size) 48 70 49 80 90 32 31 400 27.216 46.2 39.445 27.2 67.5 24 18.817 250.378 0.625945 (weighted)

Table 6. Percentage Effect Size Citation Billingsley 1998 Fowler 2002 Gulbro 1988 Johnson 2004 Shlagman 1996 TOTALS GLOBAL EFFECT SIZE 0.8912 (non-weighted) Percentage 0.9 0.842 0.95 0.944 0.82 N (Study Percentage*N Sample Size) 27 48 83 78 60 296 24.3 40.416 78.85 73.632 49.2 266.398 0.89999324 (weighted)

Examination of potential mediators. Among the five studies for which all raters were not trained, the global effect size (correlation) was 0.89 (Appendix N). In the four studies for which raters were trained, the global effect size (correlation) was higher: 0.96, which may indicate that training raters results in higher reliability. Variation among reliabilities of the nine studies was 57

42.73, and the Q-between (used to test whether the average effects from the groupings are homogeneous [Cooper, 1998]) of 12.94 is fairly low which indicates that this variable does not really help to explain the heterogeneity among the nine studies. Furthermore, the Q-within (used to compare groups of R indexes) indicates that there is 29.8 variability that cannot be explained by whether raters were trained or not. In six papers, the primary studys author did not serve as a rater. Among these, the global effect size (correlation) was 0.9 (Appendix O). Authors did serve as raters in the other three studies, and the global effect size (correlation) for these was higher, at 0.93, which may suggest that using authors as raters results in higher inter-rater reliability. Variation among the reliabilities of the nine studies was 42.74, and the low Q-between of 2.00 suggests that this variable does not help to explain the heterogeneity among the nine studies. Furthermore, the Qwithin indicates that there is 40.74 variability that cannot be exp lained by whether primary authors served as raters or not. Among the four papers for which the coder (this author) found that the study supported the art-based test, the global effect size (correlation) was high at 0.94 (Appendix P). In the five papers for which the coder was neutral on whether a given study supported the art-based test, the global effect size (correlation) was lower, at 0.86. These findings may indicate that the coders opinion about whether a studys results supported a test was slightly more reliable than whether the coder believed that a studys findings were neutral. Variation among reliabilities of the nine studies was 42.74, and the Q-between of 14.31 is fairly low which indicates that this variable does not help to explain the heterogeneity among the nine studies. Furthermore, the Q-within indicates that there is 28.43 variability that cannot be explained by the Coder favor variable. Five of the nine studies were not published. Among these, the global effect size (correlation) was 0.91 (Appendix Q). The remaining four studies were classified as published, and the global effect size (correlation) for these was lower, at 0.9. This may suggest that whether a study was published or not, it showed an average reliability of 0.9. Variatio n among the nine studies reliabilities was 42.74, and the low Q-between of 0.19 suggests that this variable does not help to explain the heterogeneity among the nine studies. Furthermore, the Q-within indicates that there is 42.54 variability that cannot be explained by whether a study was published or not. In summary, of the four potential mediating variables that were examined (rater trained vs. not trained; primary author served as rater vs. did not serve as a rater; coder found study

58

supported test vs. coder neutral on whether study supported test; study published vs. study not published), none were found to be helpful in explaining the heterogeneity among the nine studies, especially primary author rater vs. not and study published vs. not. The variables rater trained vs. not trained and coder favor vs. neutral were only moderately useful in explaining the heterogeneity, as is illustrated in Table 7.

Table 7. Potential Mediating Variables (Inter-Rater Reliability) Q-Between Probability Q-Between 0.000321600* 0.15762 0.000155369* 0.65975 Q-Within Compared to Variation among reliabilities of the nine studies 29.7968 / 42.7370 40.7402 / 42.7370 28.4310 / 42.7370 42.5432 / 42.7370 Probability Q-Within 0.000103462* 0.000000908* 0.000183652* 0.000000409*

Rater trained vs. not trained Primary author rater vs. not Coder favor vs. neutral Study published vs. not

12.9402 1.99688 14.3060 0.19383

* Significant at the <.001 level

Concurrent Validity Concurrent validity was attempted in 15 (42.86%) studies, was not attempted in 18 studies (51.43%), and two (5.71%) studies were coded as N/A in this category (Table 8). Meta-analysis of concurrent validity. Of the 15 studies that attempted concurrent validity only seven (Brudenell 1989, Gulbro 1988, Gulbro 1991, Johnson 2004, Kaiser 1993, Overbeck 2002, Wilson 2004) used correlations. Two of these studies (Gulbro 1988, Kaiser 1993) used more than one test to compare with the art-based tool, which resulted in a total of 11 studies for the final meta-analysis. The correlations were transformed into Fishers Zs. Table 9 displays the relevant data for the meta-analysis of concurrent validity. Additional data is available in Appendices R-II. Confidence intervals are included therein, as these provide a method of visually detecting heterogeneity.

59

Table 8. Concurrent Validity Frequencies Citations Brudenell 1989, Cohen 1988, Easterling 2000, Fowler 2002, Francis 2003, Gulbro 1988, Gulbro 1991, Gussak 2004, Hyler 2002, Johnson 2004, Kaiser 1993, Mills 1993, Overbeck 2002, Wilson 2004, Yahnke 2000 Batza 1995, Billingsley 1998, Coffey 1997, Cohen 1995, Couch 1994, Eitel 2004, Gantt 1990, Hacking 1996, Hacking 2000, Hays 1981, Kessler 1994, Kress 1992, Manning 1987, Mchugh 1997, Mills 1989, Neale 1994, Ricca 1992, Shlagman 1996 Bergland 1993, Wadlington 1973 Frequency Total 15

Concurrent Validity Attempted

Concurrent Validity Not Attempted

18

N/A

Table 9. Concurrent Validity Effect Sizes Obs ID Year Test r 0.751 0 0.01 0.12 -0.16 0.3 0.6 0.11 0.025 -0.116 Manuscript Patient Author Coder Sample Fishers Z Type 1 2 3 4 5 6 7 8 9 10 11 Brudenell 1989 1 Gulbro Gulbro Gulbro Gulbro Gulbro Johnson Kaiser Kaiser Wilson 1988 1 1988 1 1988 1 1988 3 1991 1 2004 3 1993 2 1993 2 2004 3 2 1 1 1 1 1 2 2 2 2 Group 1 1 1 1 1 1 1 3 3 3 3 Favor Favor 0 2 2 2 2 2 1 1 1 2 0 0 2 2 2 2 2 1 1 1 2 2 7 83 83 83 83 83 60 41 41 32 8 Size 0.97524 0.00000 0.01000 0.12058 -0.16139 0.30952 -0.09578 0.69315 0.11045 0.02501 -0.11652

-0.095485 1

Overbeck 2002 2

Category Test r Manuscript Type Patient Group Author Favor Coder Favor Sample Size

Table 8 Key Definition The type of test used for concurrent validity (1=depression measures; 2=attachment measures; 3=other drawing tests, CAT-R, or MCMI-II). The correlations between the art therapy assessment tool and the other test(s) used in the given study. The type of manuscript (1=published or dissertation; 2=unpublished). The patient group catgegory (1=major mental illnesses; 3=disorders of childhood). 0=authors conclusion that findings did not support use of art-based test; 1=authors conclusion that findings did support use of test; and 2=authors conclusion that findings neither support nor oppose use of test. 0=coders conclusion that findings did not support use of art-based test; 1=coders conclusion that findings did support use of test; and 2=coders conclusion that findings neither support nor oppose use of test. The treatment or patient group N for each study.

60

The individual effect sizes were synthesized to obtain a global effect size of 0.09. Heterogeneity (variability) was found among the effect sizes, evidenced by the wide variation in correlations (rs ranging from 0.16 to 0.751). These correlations are all low and some of them are even less than 0, which may indicate that these studies have low validity (with the exception of Brudenell 1989, however this study had a very small sample size). To determine the origin of the heterogeneity, the aforementioned six categories were treated as potential mediating variables and were statistically analyzed. Examination of potential mediators. Table 10 provides a summary of the variables that were examined as potential sources of variance. The details of the six analyses are available in Appendices T (p. 101, Author Favor), W (p. 104, Coder Favor), Z (p. 107, Manuscript Type), CC (p. 110, Patient Group), FF (p. 113, Test), and II (p. 116, Year).

Table 10. Potential Mediating Variables (Concurrent Validity) Q-Between Probability Q-Between Q-Within Compared to Variation among reliabilities of the 11 studies 26.8657 / 29.4230 24.3601 / 29.4230 22.9434 / 29.4230 24.5637 / 29.4230 17.4951 / 29.4230 14.1343 / 29.4230 Probability Q-Within

Author Favor Coder Favor Manuscript Type Patient Group Test Year

2.55721 5.06286 6.47957 4.85927 11.9278 15.2887

0.27843 0.079545 0.010912 0.027498 0.002569834 0.000478742*

0.000745726* 0.001993794 0.000745726* 0.003493698 0.025347 0.078332

* Significant at the <.001 level

These data reveal that the category Year (year of study completion or publication) is unique in that, although there is 14.1343 variability that cannot be explained by Year, the individual groupings may be able to explain the variability. In group 1, studies conducted prior to 1990, validity is low at 0.00443. The group 2 studies, those conducted between 1990-2000, are most valid at 0.34034. The studies conducted since 2000, group 3, are the least valid at -0.05836.

61

Conclusion In this chapter, descriptive results were provided. This included data for the following categories: citation dates; citation types; art therapy assessment types; patient group categories; and rater tallies. Detailed information about inter-rater reliability was described, including metaanalysis results and an examination of potential mediating variables. Concurrent validity data were also examined, consisting of meta-analysis results and an examination of potential mediators. Conclusions based upon these results are described in the subsequent chapter.

62

CHAPTER 5 CONCLUSIONS It is the consensus of most mental health professionals, agency administrators, and insurance companies that regardless of the formality or structure, assessmentand reassessment at appropriate timesconstitutes the core of good practice (Gantt, 2004, p. 18). It has been demonstrated in the previous chapters that art therapists use and develop assessments and rating instruments, and that there is a need to improve the validity and reliability of these tools. In order to address this need, a systematic analysis of 35 pertinent studies was conducted. A hypothesis and three research questions were formulated to specifically deal with the issues at the root of the problem. These are discussed presently, in relation the studys results. To address research question one, methodological problems identified in the primary studies are described. Problems with validity and reliability of the primary articles are delineated, followed by a discussion of methods for the improvement of assessment tools. Subsequently, questions two and three are addressed: to what extent do art therapy assessments measure the process of change that a client may experience in therapy?; and, do objective assessment methods such as standardized art therapy tools give us enough information about clients? Limitations of the study are presented in the following sections: validity issues in study retrieval; issues in problem formulation; judging research quality; issues in coding; sources of unreliability in coding; validity issues in data analysis; and issues in study results. Finally, recommendations for further study are described. Research Questions Question One To what extent are art therapy assessments and rating instruments valid and reliable for use with clients? It was assumed that methodological difficulties on previous art therapy

63

assessment research exist, and that this would impact the validity and reliability of assessment tools. Methodological problems identified in the primary studies. Twenty-eight of the 35 studies reported methodological weaknesses (Appendix JJ). However, seven studies did not mention weaknesses, and some failed to include an exhaustive list of problems. The weaknesses that were reported are tallied in Figure 4. Twenty- four studies acknowledged a total of 39 subject-related flaws. Data collection problems were also frequent, with 16 studies reporting a total of 18 occasions. Other reported flaws are: (1) rating instrument-related (11 studies); (2) concerned with inter-rater reliability (eight studies); (3) categorized as other (five studies); and (4) assessment-related (two studies). Three papers that failed to report study procedures and/or findings (Cohen, Hammer & Singer, 1988; Hays, 1981; McHugh, 1997) were identified. A more specific itemization of author- identified study flaws is included in Appendix KK. Coderidentified weaknesses are included in Appendix LL. Problems with validity and reliability of the articles researched. As McNiff (1998) concluded, many tools are generally deficient of data supporting their validity and reliability, and are not supported by credible psychological theory. Specifically, McNiff found that those who choose to assess clients through art have neglected to convincingly address the essence of empirical scientific inquiry: (1) findings that link character traits with artistic expressions; (2) replicable results based upon copious and random data; and (3) uniform outcome measures which justify diagnosis of a client via his or her artwork. Furthermore, Gantt and Tabone (1998) reported that the previous research on projective drawings and assessments has yielded mixed results. These mixed results are reflected in a summary of the meta-analyses identified in Chapter Two of the present study. Garb (2000) found that projective techniques should not be used to detect child sexual abuse, whereas West (1998) concluded that projective instruments could effectively discriminate distressed children from those who were non-distressed. Hunsley and Bailey (1999) found that there is no scientific basis for using the Rorschach, and provided ample support for this conclusion. Similarly, Meyer and Archer (2001) demonstrated that the Rorschach had greater validity for some purposes than for others, and identified eleven salient empirical and theoretical gaps in the Rorschach knowledge base. Conversely, Parker, Hanson, and Hunsley (1988)

64

determined that the Rorschach could be considered to have sufficient psychometric properties if used according to the purpose for which it was designed and validated.

No weaknesses identified (7) Procedures/findings not reported (0) Subject-related weaknesses (39) Data collection weaknesses (18) Assessment -related weaknesses (2) Rating instrument weaknesses (11) Inter-rater reliability weaknesses (9) Other (5)

Figure 4. Author-Identified Study Weaknesses

McNiffs (1998) findings are supported to some extent by the descriptive results of the present study. That 18 out of 35 papers included in this study did not attempt concurrent validity, and that 16 out of 35 did not attempt inter-rater reliability, reflects McNiffs conclusion that many tools are generally deficient of data supporting their validity and reliability. Concurrent validity tells us about the degree to which the scores on an instrument are related to the scores on another instrument administered at the same time, or to some other criterion ava ilable at the same time (Fraenkel & Wallen, 2003). In the current research, of the 15 (42.86%) primary studies that did not attempt concurrent validity, 11 used correlations and were therefore eligible for the application of meta-analysis techniques. The category Year (year of 65

Batza 1995 Bergland 1993 Billingsley 1998 Brudenell 1989 Coffey 1997 Cohen 1988 Cohen 1995 Couch 1994 Easterling 2000 Eitel 2004 Fowler 2002 Francis 2003 Gantt 1990 Gulbro 1988 Gulbro 1991 Gussak 2004 Hacking 1996 Hacking 2000 Hays 1981 Hyler 2002 Johnson 2004 Kaiser 1993 Kessler 1994 Kress 1992 Manning 1987 Mchugh 1997 Mills 1989 Mills 1993 Neale 1994 Overbeck 2002 Ricca 1992 Shlagman 1996 Wadlington 1973 Wilson 2004 Yahnke 2000

Weakness Category and Total Number of Weaknesses

study completion or publication) was found to be unique in that, although 14.13 variability could not be explained by Year, the individual groupings might have been able to explain the variability. In group 1, studies conducted prior to 1990, there was zero validity. The group 2 studies, those conducted between 1990-2000, were most valid at 0.34. The studies conducted since 2000, group 3 were the least valid at -0.06. However, these results should be interpreted with caution. Overall, each of the variables that were examined as potential sources of variance was not found to contribute significantly to the variance. Thus, results are inconclusive. The previous research on psychological tools has yielded mixed results (Gantt and Tabone, 1998). For the 35 studies included in the present analysis, results also appear to be mixed, although the majority (about 94 percent) of studies were neutral or positive. As is illustrated in Tables 11 and 12, in 19 of 35 studies, the authors concluded that the studys findings did support the use of the art therapy assessment, whereas in only two of 35 studies did the authors indicate that their research did not support the use of an assessment. In 14 studies, the authors concluding remarks revealed that their studies neither supported nor opposed the use of an assessment (i.e., results were neutral). Coder 1 found that seven studies supported the use of the art therapy assessment; that 26 studies neither supported nor opposed; and that two did not support use of the assessment. Thus, for both author and coder judgments, the results were somewhat mixed, the majority of results found to be either neutral or in favor of the use of the given assessment. Six (or 17.14%) studies used one rater, thus inter-rater reliability was not applicable. For the seven studies that qualified for inclusion in the meta-analysis of inter-rater reliability, the global effect size was high, at 0.9. However, of the four potential mediating variables that were examined (rater trained vs. not trained; primary author served as rater vs. did not serve as a rater; coder found study supported test vs. coder neutral on whether study supported test; study published vs. study not published), none of these explained the heterogeneity among the nine studies, especially primary author rater vs. not and study published vs. not. The variables rater trained vs. not trained and coder favor vs. neutral were only moderately useful in explaining the heterogeneity. Therefore, it is not appropriate to draw conclusions about interrater reliability based on the current studys findings.

66

Table 11. Author/Coder Favor Tally Citation ID Batza 1995 Bergland 1993 Billingsley 1998 Brudenell 1989 Coffey 1997 Cohen 1988 Cohen 1995 Couch 1994 Easterling 2000 Eitel 2004 Fowler 2002 Francis 2003 Gantt 1990 Gulbro 1988 Gulbro 1991 Gussak 2004 Hacking 1996 Hacking 2000 Hays 1981 Hyler 2002 Johnson 2004 Kaiser 1993 Kessler 1994 Kress 1992 Manning 1987 Mchugh 1997 Mills 1989 Mills 1993 Neale 1994 Overbeck 2002 Ricca 1992 Shlagman 1996 Wadlington 1973 Wilson 2004 Yahnke 2000 0= 1= 2= Author Coder Favor Favor 2 1 1 0 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 2 1 1 1 2 1 1 2 1 2 2 2 2 1 0 2 2 1 2 0 2 2 2 2 2 0 2 1 2 2 2 1 2 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2

conclusion that the studys findings did not support use of the art -based test conclusion that the studys findings did support use of the test conclusion that the studys findings neither support nor oppose use of the test

67

Table 12. Author/Coder Favor Frequencies Author Favor Author seemed to conclude that the studys findings DID support use of the test Author seemed to conclude that the studys findings neither support nor oppose use of the test Author seemed to conclude that the studys findings DID NOT support use of the artbased test Frequency Total 19 Coder Favor Coder concluded that the studys findings DID support use of the test Coder concluded that the studys findings neither support nor oppose use of the test Coder concluded that the studys findings DID NOT support use of the art-based test Frequency Total 7

14

26

In summary, question one, To what extent are art therapy assessments and rating instruments valid and reliable for use with clients? cannot be addressed by the present results. Therefore, although methodological difficulties on previous art therapy assessment research appear to exist, details about how these difficulties actually impact the validity and reliability of assessment tools remain unknown. Methods for improvement of tools. This studys findings have implications for the improvement of existing tools as well as those that have yet to be developed. In Chapter Two, avenues for the improvement of art therapy assessments and rating instruments were discussed. Findings of the present study support the recommendations for enhancement of existing tools as well as future research in this realm put forth by previous authors, and also reveal additional areas in which researchers could make improvements. As is listed in Table 13, the results of the present study support the recommendations for improvements identified by previous authors: (1) data should be collected from a large number of participants (Hagood, 2002; Neale & Rosal, 1993); (2) subjects should be matched (Hagood, 2002); (3) researchers should ensure that they have consulted previous assessment literature prior to developing their own tools and/or furthering the existing research (Hacking, 1999); (4) researchers should publicly admit to flaws in their work and make trainees aware of these flaws while striving for improvement of the assessment tool and rating system; (5) rating systems 68

should incorporate objective criteria on which to score variables (Neale & Rosal, 1993); (6) interrater reliability should be established; and (7) data collection and appropriate analysis procedures should be duplicated to establish reliability and effectiveness of previously studied art-based tests. The present study was limited by the sampling methods used in the primary research papers: The sampling methods of 22 (62.86%) studies were flawed, thereby preventing generalizability of results.

Table 13. Sampling Flaws Sampling Flaws Small N (less than 30) Citations Batza 1995, Billingsley 1998, Brudenell 1989, Cohen 1995, Couch 1994, Easterling 2000, Fowler 2002, Gantt 1990, Hacking 1996, Manning 1987, Mills 1993, Ricca 1992, Wilson 2004. Batza 1995, Brudenell 1989, Coffey 1997, Cohen 1995, Couch 1994, Hacking 1996, Kress 1992, Mchugh 1997, Mills 1989, Neale 1994 Billingsley 1998, Francis 2003, Gussak 2004, Ove rbeck 2002, Ricca 1992, Shlagman 1996 Batza 1995, Bergland 1993, Brudenell 1989, Coffey 1997, Cohen 1988, Cohen 1995, Couch 1994, Eitel 2004, Fowler 2002, Gulbro 1988, Gulbro 1991, Hacking 2000, Hays 1981, Hyler 2002, Johnson 2004, Kaiser 1993, Kessler 1994, Kress 1992, Mchugh 1997, Mills 1989, Mills 1993, Wadlington 1973, Wilson 2004, Yahnke 2000 Frequency Total 13

Subjects not matched

10

Subjects not randomly selected Sampling methods unable to be determined

6 24

A breakdown of author- identified study weaknesses in relation to rating systems illustrates the ways in which existing systems could be improved (Table 14), and thereby holds implications for systems that have yet to be developed. Because the majority of primary studies were about the DDS, a great deal of information about the DDS rating system is available. Table 14 provides a list of suggestions for revision of the rating guide and scales of the DDS Drawing Analysis Form (DAF). Specifically, Mills (1989) found that the rating guide shows a lack of consistent theoretical outlook and suggested 69

that it be improved by rewriting for clarification of terms and rating procedures, and for inclusion of illustrations to augment verbal definitions (p. 133). Johnson (2004) suggested the addition of two items to DDS checklist: multidirectional movement and unrelated multiple images. Problems with the dichotomous and categorical rating scales in the DDS Rating Guide and Drawing Analysis Form were cited (Billingsley 1998, Gulbro 1991, Gulbro 1991, Neale 1994), and some authors reported weaknesses with the DDS Content Checklist and Tree Scale (Couch 1994, Kress 1992). Ricca (1992) found the DAF to be subjective and recommended that more objective criteria be incorporated. Finally, Fowler (2002) criticized Mills and Cohens original inter-rater reliability results. In 1990, Gantt stated that two FEATS scales (perseveration and rotation) needed additional refinement in order to be useful, and that the FEATS scales used imprecise measurements. Francis (2003) found that wording of Birds Nest Drawing directives and graphic indicators on the checklist could be worded more precisely. If researchers were to consider the suggestions put forth in Table 14, improvements to rating systems could be implemented. As discussed previously, researchers should also bear in mind that interval/ratio scales can be compared in terms of direction or magnitude, but that the scores will be more variable. Conversely, nominal and ordinal measures should not be compared in terms of direction or magnitude, but they are more likely to produce consistent responses. Because there is no conclusive evidence for using Likert-type (interval) versus binary-choice (nominal) items in rating instruments, the choice should be specific to the instruments purpose (B. Biskin, personal communication, February 22, 2005). The format that best represents the underlying construct to be measured should guide the selection of format. Both methods have value as long as their limitations are realized. Well-constructed, standardized scales for rating artwork are vital in order to validate assessment findings and to determine the reliability of subjects scores. In order to address question one, methodological problems identified in the primary studies, and problems with their validity and reliability were discussed. Although question one could not be addressed by the present results, it was determined that this studys findings have implications for the improvement of existing tools as well as those that have yet to be developed. Information about flaws in the FEATS, BND and in particular the DDS, was provided. It was suggested that if weaknesses are better understood by researchers, then this information should

70

help to improve the existing rating systems, and ensure the enhancement of newly developed systems.

Table 14. Rating Systems Needing Revision CITATION ID Studies that identified the DDS Rating System as needing revision: Billingsley 1998 Couch 1994 Fowler 2002 Gulbro 1988 Gulbro 1991 Johnson 2004 Kress 1992 Mills 1989 AUTHORS COMMENT PAGE

Stated that the DDS Rating Guide may need revision as categorical scales were found to be problematic. Reported weakness of DDS Content Checklist and Tree Scale. Need for standardized rating criteria on DAF. Criticism of Mills and Cohens inter-rater reliability. Weaknesses of the DDS itself (lack of ordinal scales on DAF) Dichotomous and categoric scoring variables seriously limit the sensitivity of the instrument. Author suggested adding two items to DDS checklist: multidirectional movement and unrelated multiple images. Content checklist and Creekmore Tree Scale have not been tested for reliability and validity. The rating guide, created by a number of art therapists from diverse backgrounds, shows a lack of consistent theoretical outlook, a weakness attributable to the eclecticism of its creators. It could be improved by rewriting for clarification of terms and rating procedures, and for inclusion of illustrations to augment verbal definitions. However, ratings resulting from the improved system would then differ from the ratings resulting from using the 1986-B format, and results would not be comparable. The Drawing Analysis Form and the Rating Guide may need revision. Because of the categorical nature of the rating format, statistical analysis of the data is extremely complicated. For a less complicated analysis, the rating format should be comprised of variables rated on a continuous data scale. By using a continuous data scale, the distribution of errors would be normal and use of a statistical analysis with more power would be possible. Statistical analysis limited to 14 of 23 DAF categories; The DAF seemed subjective, therefore more objective criteria is highly suggested.

p. 124 p. 113 p. 224 p. 123 p. 355 p. 23 p. 133

Neale 1994

p. 126

Ricca 1992 Studies that identified the FEATS as needing revision: Gantt 1990 Studies that identified the BND Rating System as needing revision: Francis 2003

p. 126 p. 128

Two FEATS scales (perseveration and rotation) still need additional refinement to be useful. The scales use imprecise measurements.

p. 218

Wording of directives; wording of graphic indicators on checklist should be more precise.

p. 135

71

Question Two It was assumed that art therapy assessments measure the process of change that a client may experience in therapy. However, the variability of the concurrent validity and inter-rater reliability meta-analyses results of the present study indicates that the field of art therapy has not yet produced sufficient information in the area of assessments and rating instruments. The variation across the primary studies was vast and may have been produced by one of two factors that are known to cause differences in tests of main effects: (1) sampling error (chance fluctuations due to the imprecision of sampled estimates), and (2) differences in study participants or how studies are conducted (Cooper, 1998). Taveggia (1974) underscored the implications of using sampling techniques and probability theory to make inferences about populations: A methodological principle overlooked by writers ofreviews is that research results are probabilistic. This suggests thatthe findings of any single research are meaninglessthey have occurred simply by chance. It also follows that if a large enough number of researches has been done on a particular topic, chance alone dictates that studies will exist that report inconsistent and contradictory findings! (pp. 397-398) Thus, chance fluctuation due to the inexactness of sampled estimates is one possible source of variance in the present studys results. To address the issue of potential variability attributable to methodology, Coopers (1998) recommendation to examine substantive differences between studies was followed. However, the results of the sensitivity analyses revealed trivial sources of variance none of the mediating variables were found to significantly contribute to the variance in the results. Thus, due to sampling error and vast differences in study participants and methodology, the extent to which art therapy assessments measure the process of change that a client may experience in therapy cannot be determined. Question Three Do objective assessment methods such as standardized art therapy tools give us enough information about clients? Based on the review of the literature, it was assumed that the most effective approach to assessment incorporates objective measures such as standardized assessment procedures (formalized assessment tools and rating manuals; portfolio evaluation; behavioral checklists), as well as subjective approaches such as the clients interpretation of his

72

or her artwork. Due to the inconclusive results of the present study, it is recommended that researchers continue to explore these objective and subjective approaches to assessment. Based on the present studys outcomes, question three cannot be addressed, because the field of art therapy has not yet produced sufficient information in the area of assessments and rating instruments. Rather, objective assessment methods such as standardized art therapy tools not only fail to give us enough information about clients, but the previous research fails to provide adequate information about the tools themselves. The present study sought to answer the question, What does the literature tell us about the current state of art therapy assessments? The null hypothesis, that homogeneity exists among the study variables identified in art therapy assessment and rating instrument literature, was rejected, thereby demonstrating that art therapists are still in a nascent stage of understanding assessments and rating instruments. Limitations of the Study Seven types of limitations were identified in the present study. These include: validity issues in study retrieval; issues in problem formulation; judging research quality; issues in coding; sources of unreliability in coding; validity issues in data analysis; and issues in study results. Validity Issues in Study Retrieval The most serious form of bias enters a meta-analysis at the literature search stage, because of the difficulty inherent in assessing the influence of a potential bias (Glass et al., 1981). Furthermore, every study (hence every individual represented therein) used in a synthesis has an unequal chance of being selected (Cooper, 1998). Although every effort was made to gather all published and unpublished research for the present study, this synthesist encountered problems that were beyond her control. These included: (a) primary researchers who reported data carelessly or incompletely; and (b) individuals and/or libraries that faltered in their services, failing to provide all potentially relevant documents. It was not possible to locate many theses and dissertations. Access to more papers would have increased the number of studies included in the concurrent validity and inter-rater reliability meta-analyses, which may have yielded more substantial results.

73

Issues in Problem Formulation This researcher endeavored to uncover variables to demonstrate why results varied in different studies and to generate notions that would explain these higher-order relations (Cooper, 1998). Despite these efforts, however, it is possible that some variables were not uncovered. Judging Research Quality Cooper (1998) cited two sources of variance in the quality judgment of research evaluators: (a) the relative importance they assign to different research design characteristics, and (b) their judgments about how well a particular study meets a design criterion. Synthesists predispositions usually have an impact on methods for evaluating studies, and the a priori quality judgments required to exclude studies are likely to vary from judge to judge (p. 83). Although synthesists are usually aware of these biases as well as the outcomes of stud ies as they collect the research, it is likely that the findings will be tainted by the evaluators predispositions. Issues in Coding Low- inference codings are those that require the researcher to locate the needed information in the primary study and transfer it to the coding sheet (Cooper, 1998, p. 30). Codings of high- inference, on the other hand, require coders to make inferential judgments about the studies, which may create problems in reliability. In the present study, high- inference judgments were a factor in assigning scores to the categories Author Favor and Coder Favor, as Coder 1 had to decide to what extent she and the primary authors favored the use of the assessment based upon primary study findings. It is possible that reliability may have been compromised by these subjective judgments. Other Sources of Unreliability in Coding Problems of reliability evolve in a meta-analysis at the coding stage. Issues arise when different coders fail to see or judge study characteristics in the same way (Glass et al., 1981). Four sources of unreliability in study coding that may have been a factor in the present study, include: (a) recording errors; (b) lack of clarity in primary researchers descriptions; (c) ambiguous definitions provided by the research synthesist, which lead to disagreement about the proper code for a study characteristic; and (d) coders predispositions, which can lead them to favor one interpretation of an ambiguous code over another (Cooper, 1998). Validity Issues in Data Analysis

74

During the data analysis stage, inappropriate use of rules of inference creates a threat to validity (Cooper, 1998). In order to avoid this, the researcher of the present study worked closely with a statistician consultant. In addition, every effort was made to avoid misinterpretation of synthesis-generated results in such a way as to substantiate statements of causality, because any conclusions based on synthesis-generated evidence are always purely associational (p. 155). Issues in Study Results Although it is ideal that the incorporated studies permit generalizations to the particular individuals or groups of focus (Cooper, 1998), the present study was limited by the sampling methods used in the primary research papers. Because the sampling methods of 22, or 62.86% of studies were flawed, generalizability of results is not possible. Furthermore, although a larger sample size usually increases stability, if more studies could have been retrieved, it is likely that even more variability would have been found. It is apparent that relationship strength can be examined in a meta-analysis, but causality is a separate matter (Cooper, 1998). It is very risky to make statements of causality based on synthesis-generated results, because it is impossible to determine the true causal agent. When a relation between correlated characteristics is found to exist, however, this information can be used to recommend future directions for primary researchers. Recommendations for Further Study There are several areas on the topic of assessment in which further study could be pursued: (1) replication of Hackings study on individual drawing scales; (2) exploration of amalgamated findings of different assessments; (3) completion of a systematic analysis of patient groups; and (4) periodic execution of thorough literature reviews and systematic analyses. These are discussed presently. By including non-art therapy assessment tools in her meta-analysis, Hacking (1999) was able to meta-analyze drawing elements (such as color, line, and space). Hacking listed the items in a large table with their test values and p- values, and standardized the scores to aggregate them. This was followed by the application of a meta-analytic procedure to each category to discover whether the non-significant results outweighed the significant. For example, Hacking aggregated all variables covering 14 drawing areas categorized as form variables, objective content or subjective content. Subjective variables seemed to produce the largest effect, but there were demonstrable if small effects for the two other categories. Specific results are as follows:

75

Form variables (77 form variables from 14 tabulated drawing areas); effect size: 0.1977; confidence limits (all significances p<0.05): 0.1217-0.2736. Objective content (38 subjective content variables from 14 tabulated drawing areas); effect size: 0.3062; confidence limits (all significances p<0.05): 0.2105-0.4020. Subjective content (102 subjective content variables from 14 tabulated drawing areas); effect size: 0.4283; confidence limits (all significances p<0.05): 0.3704-0.4863. It would be useful for an art therapist to replicate Hackings study with art therapy assessment tools, if a larger number of studies that include data on individual drawing scales become available in the future. Because the majority of papers located for inclusion in the present study were about the DDS, the results offer more information about the DDS than any other assessment. In the future it would be advantageous to explore amalgamated findings of other assessments to the same extent. If more studies could be located, it might be worthwhile to conduct a systematic analysis of patient groups. Such an analysis could enable art therapists to gain a better understanding of the implications of the use of different assessments with various patient groups. This type of investigation might also increase the validity of assessment instruments. The field of art therapy would benefit if thorough literature reviews and systematic analyses on this topic were to be conducted periodically. This would help researchers to maintain a sound understanding of the work that has already been accomplished, and what needs to be done to improve the research. Conclusion In order to address the hypothesis and research questions established by this researcher, the results of a systematic analysis of 35 studies were presented and discussed. To address the first research question, methodological problems identified in the primary studies were described. Problems with validity and reliability of the primary articles were discussed, followed by a delineation of methods for the improvement of assessment tools. Question two could not be addressed due to sampling error and vast differences in study participants and methodology. Because the field of art therapy has not yet produced sufficient information in the area of assessments and rating instruments, question three could not be addressed. The studys limitations, which included: validity issues in study retrieval; issues in problem formulation; judging research quality; issues in coding; sources of unreliability in coding; validity issues in

76

data analysis; and issues in study results, were presented. Recommendations for further study were provided. Based on the present analysis, the art therapy assessment and rating instrument literature reveals that flaws in the research are numerous, and that much work has yet to be done. As was revealed via an extensive review of the literature and elucidation of the debated issues, art therapists are still in a nascent stage of understanding assessments and rating instruments. Decisions about treatment and diagnosis are often based upon the results of various assessment and evaluation techniques. When any form of artwork is used to gain insight about clients, art therapists need to be aware of the benefits and limitations of their approach and the tools they use. It is the responsibility of every art therapist to be well versed in the areas of evaluation, measurement, and research methodology. Art therapists should identify their own personal philosophy, whether in support of or opposed to the use of assessments. Perhaps the wisest stance on assessment involves embracing both sides of this issue and moving forward with the work that needs to be done.

77

APPENDIX A ART THERAPY ASSESSMENT INSTRUMENTS This is not an exhaustive list. Reference citations included when available.
Art Therapy Assessment Instruments Included in the Systematic Analysis Birds Nest Drawing (BND) (Kaiser, 1993) Bridge Drawing (Hays & Lyons, 1981) Diagnostic Drawing Series (DDS) (Cohen, Hammer & Singer, 1988) A Favorite Kind of Day (Manning Rauch, 1987) Person Picking an Apple From a Tree (PPAT) Art Therapy Assessment Instruments Arrington Visual Preference Test (AVPT) Art Therapy-Projective Imagery Assessment (ATPIA) (EVMS) Dot- to-Dot Draw-A-Village Task (DAV) Draw-Yourself Task (DYT) Face Stimulus Assessment (FSA) (Betts, 2003) Favorite Place Drawings Levick Emotional and Cognitive Art Therapy Assessment (LECATA) (Levick) Mandala Assessment Research Instrument (MARI) Card Test (Kellogg) Ulman Personality Assessment Procedure (UPAP) (Ulman) Corresponding Standardized Rating System Attachment Rating Scale (Kaiser, 1993) 12 variables assembled by authors (informal) Drawing Analysis Form; Content Checklist (Cohen, 1985/1994) Aggression Depicted in the AFKD Rating Instrument (A three-item checklist) (Manning Rauch, 1987) Formal Elements Art Therapy Scale (FEATS) (Gantt & Tabone, 1998) Corresponding Rating System (Dorris Arrington) (EVMS)

FSA Rating Guidelines (Betts, 2003) (Myra Levick, 2001) Manual containing possible interpretations A checklist Stand-Alone Rating Systems Descriptive Assessment for Psychiatric Art (DAPA) Sheppard Pratt Art Rating Scale (SPAR) (Shoemaker-Beal, 1977; Bergland & Moore Gonzalez, 1993) Questionnaire NBS (Elbing & Hacking, 2001) for objective picture criteria; Semantic Differential (Simmat, 1969) for contextual-affective picture criteria. Unnamed 18-item rating scale (with 4 additional color-related test items) (Wadlington & McWhinnie, 1973)

78

APPENDIX B TWO POPULAR ART THERAPY ASSESSMENTS AND THEIR CLINICAL APPLICATIONS: THE DDS AND THE PPAT The Diagnostic Drawing Series (DDS) (Cohen, Hammer, & Singer, 1988) The DDS s a three-picture art interview that was developed in 1982 by art therapists Barry Cohen and Barbara Lesowitz in Virginia. It first entered the published literature in 1985. Drawings are collected and maintained for research purposes in the DDS archive, Alexandria, VA. The DDS should be administered on a tabletop during one 50- minute session (most people finish the Series in about 20 minutes, leaving time for discussion). It can be administered individually and in groups. Materials are: 18 x 24 white 60 lb or 70 lb drawing paper that has a slight tooth or texture; standard 12 pack of Alphacolor square chalk pastels (unwrapped) in North America; Faber Castell elsewhere. DDS Directives (Cohen & Mills, 2000): 1) Free Picture: Make a picture with these materials. This is the unstructured task of the Series. It typically reveals a manifestation of the clients defense system (i.e., the amount and type of information the patient is initially willing to share [Feder & Feder, 1998]) 2) Tree Picture: Draw a picture of a tree. (NB: even if a tree was drawn in the first picture.) This is the structured task of the Series. Deemed to be non-threatening subject (Feder & Feder, 1998). 3) Feeling Picture: Make a picture of how youre feeling using lines, shapes, and colors. This is a semi- structured task, it asks the client to communicate about his/her affective state directly, as well as to represent it in abstract form. Patients are rarely fooled by the artifice of projective tests, which are now part of our popular culture. If we want to know about the patients experience or self concept, why not simply ask? (Cohen, Mills, & Kijak, 1994) The Person Picking an Apple From a Tree (PPAT) (Gantt, 1990) This assessment consists of one drawing that can be administered in an individual or group art therapy session (Arrington, 1992). The client is asked to Draw a picture of a person picking an apple from a tree. The materials include one piece of 12 by 18 inch white drawing paper and a set of 12 Mr. Sketch scented felt-tip markers. The drawings are rated with the Formal Elements Art Therapy Scale (FEATS) manual (Gantt & Tabone, 1998). The PPAT drawing was first described by Viktor Lowenfeld (1939, 1947) in a study he conducted on childrens use of space in art. His instructions were detailed: You are under an apple tree. On one of its lower branches you see an apple that you particularly admire and that you would like to have. You stretch out your hand to pick the apple, but your reach is a little short. Then you make a great effort and get the apple after all. Now you have it and enjoy eating it. Draw yourself as you are taking the apple off the tree (1947, pp. 75-76). Little else has been written about this drawing. Greg Furth included examples in his book The Secret World of Drawings (1988, p. 86-88) but did not discuss his reason for using it. Gantt and Tabone (1998) credit art therapist Tally Tripp for bringing the idea of the PPAT to Washington, DC from Georgia. Ta lly had worked with a Georgia art therapist who used the PPAT frequently.

79

APPENDIX C HOW AN ART THERAPY ASSESSMENT IS DEVELOPED As artists, art therapists believe that they possess a unique sensibility for studying drawings (Hagood, 2002). It follows, therefore, that they attempt to create their own assessment instruments. However, developing a standardized art therapy assessment is a long and arduous undertaking (Betts, 2003). A lifetime of effort can be devoted to this endeavor (Kaplan, 2001, p. 144). It is common to find an art therapist who has used any number of previously existing assessments, such as the DDS or the PPAT. However, there are occasions when an art therapist has used such tools and has found them to be inappropriate for use with his or her specific group of clients. Perhaps through trying out different ways of working with the clients, and experimenting with different media and methods, the art therapist may begin to formulate ideas about creating his or her own unique tool that is tailored for use with a particular client population. For example, art therapist Donna Betts (2003) developed the Face Stimulus Assessment (FSA) while she was employed at a multicultural school in a major metropolitan area. She had not had much success in using the established art therapy instruments, so she invented a technique for evaluating her non-verbal clients who had cognitive impairments. Betts clients were not motivated to draw without a visual stimulus and they were unable to follow directions. Even a basic directive such as draw a person did not elicit any response from the clients who had severe mental retardation or developmental delay. Thus, over time and with several pilot trials, Betts (2003) developed a method that would uncover her clients strengths through art. Betts experience in creating the FSA is not so unlike the processes that Ulman (1965, 1975, 1992; Ulman & Levy, 1975, 1992), Kwiatkowska (1975, 1978), and Silver (1978, 1983) undertook in developing their assessments. Ulman, Kwiatkowska, and Silver also designed tools based on their clinical experience with clients who had unique challenges, as was discussed in Chapter Two of this prospectus. Once a new assessment has been piloted, then it must undergo rigorous validity and reliability studies in order to be established as a tool and used in clinical settings. Lehmann and Risquez (1953) delineated specific requirements that are necessary for ensuring the credibility of an art-based assessment: 1) It should be possible to obtain repeated productions that are comparable in order to obtain a longitudinal view of the variations in the patients graphic expression over a period of time. 2) The method should allow for the comparison of productions of the same patient and of different patients at different times by means of a standardized method of rating. 3) It should be possible to obtain valid and useful information on the patients medical condition through the evaluation of his or her paintings without having to spend additional time in observing the patient while he or she is painting or in conducting an interview about the finished product.

80

APPENDIX D

81

APPENDIX E

82

APPENDIX F Hacking, 1999

83

APPENDIX G

84

85

APPENDIX H

86

87

APPENDIX I CODING SHEET FOR ART THERAPY ASSESSMENT AND RATING INSTRUMENT STUDIES DONNA BETTS 2005 Reviewer/Coder: _____________________________________________ Coder ID: ________ Todays Date: __ __/__ __/__ __ Citation Information: Citation ID #: __________ Author(s): _______________________________________________________________ Major discipline of the first or primary author (1=Art Therapy; 2=Psychology; 3=Counseling; 4=Social Work; 5=Other; 99=unable to determine): _______ Title: ________________________________________________________________________ ________________________________________________________________________ Published Study Publication: ________________________________________________________________________ Year: ________ Volume: ________ Issue: ________ Pages: ________ Unpublished Study Masters Thesis Year: ________ Doctoral Dissertation Year: ________ Other: __________ Year: ________ Search Method: 1. Electronic search: ____________ 3. Personal recommendation: ____________ 2. Manual search: ______________ 4. Other: ____________________________ Research Design: Art therapy assessment instrument used in study: 1. Diagnostic Drawing Series (DDS) 2. Person Picking an Apple From a Tree (PPAT) 3. Other: ______________________________________________ Type of rating system used to rate artwork in study: 1. Drawing Analysis Form (corresponds to DDS) 2. Formal Elements Art Therapy Scale (FEATS) (corresponds to PPAT) 3. Other: ________________________________________________________ 4. Other: ________________________________________________________ 99. Unable to determine Rating Procedures: Number of raters used to rate drawings: 1. 2. 3. 4. 99. Unknown Was the primary researcher one of the raters?

88

1. Yes 2. No 99. Unable to determine Were raters trained? Rater 1: 1. Yes 2. No 99. Unable to determine Rater 2: 1. Yes 2. No 99. Unable to determine Rater 3: 1. Yes 2. No 99. Unable to determine Design type: Concurrent validity attempted Concurrent validity not attempted Test(s) used in comparison to the art therapy assessment tool: ____________________________________________________________ 1. Reported 2. Not reported 99. Unable to determine Statistical measure: ____________ Statistical value: ___________ Criterion validity: Were the patients/subjects drawings compared to a 2nd set of drawings? 1. Yes 2. No 99. Unable to determine If yes: Normal controls Archival drawings Were conclusions about the initial drawings backed up by a psychiatrists or physicians diagnosis of the patient? 1. Yes 2. No 99. Unable to determine Predictive validity: 1. The assessment was developed to predict future behavior (predictive validity) What was the researcher attempting to predict? (e.g., assessment predicted suicidality.) ______________________________________________________ 2. The assessment was developed to assess current pathology What was the researcher attempting to assess? ______________________________________________________ Intra-rater reliability: 1. Reported 2. Not reported 99. Unable to determine Statistical measure: ____________ Statistical value: ___________ Inter-rater reliability: 1. Reported 2. Not reported 99. Unable to determine Statistical measure: ____________ Statistical value: ___________ Test-retest reliability attempted Test-retest reliability not attempted Statistical measure: ____________ Statistical value: ___________ Length of time between testing: _________ Equivalent forms reliability (like MMPI has order of questions in test is changed): 1. Reported 2. Not reported 99. Unable to determine Statistical measure: ____________ Statistical value: ___________ Describe the difference between the two forms of the test: __________________________________________________________________ __________________________________________________________________ Split- half reliability: 1. Reported 2. Not reported 99. Unable to determine

89

Statistical measure: ____________ Statistical value: ___________ What part of the assessment was compared to the other part? __________________________________________________________________ Site Variables: Location (use state abbreviations): ______ 99. Unknown Source of funding: 1. Public 2. Private 99. Unknown Setting: 1. Hospital 2. School 3. Day Treatment Center 4. Other: ________________________________ 5. Unknown Participant Variables: # of sites: ______ List sites: _____________________________________________

Patient Group: # of participants in tx group: ______ 1. Random sample 2. Convenience sample Diagnostic selection criteria: Ss screened for pathology 1. Yes Gender: # Female: ______ Age: Average age (F): ______

99. Unable to determine

2. No 99. Unable to determine # Male: ______ Average age (M): ______ Total average age: ______ Lowest included age: ______ Highest included age: ______ Socioeconomic status: Ethnic group: 1. Lower # or %: ______ 1. White # or %: ______ 2. Middle # or %: ______ 2. Black # or %: ______ 3. Mixed # or %: ______ 3. Other ___________________ 3. Other # or %: ______ # or %: ______ 99. Unknown 99. Unknown Control Group (if applicable): # of participants in control group: ______ 1. Random sample 2. Convenience sample 99. Unable to determine Diagnostic selection criteria: Ss screened for absence of pathology 1. Yes 2. No 99. Unable to determine Gender: # Female: ______ # Male: ______ Age: Average age (F): ______ Average age (M): ______ Total average age: ______ Lowest included age: ______ Highest included age: ______ Socioeconomic status: Ethnic group: 1. Lower # or %: ______ 1. White # or %: ______

90

2. Middle # or %: ______ 3. Mixed # or %: ______ ___________________ 3. Other # or %: ______ 99. Unknown

2. Black 3. Other

# or %: ______ # or %: ______ 99. Unknown

Statistical Information: _______________________________________________________________________ _______________________________________________________________________ Study Quality: Results Favor Use of This Assessment (according to studys author): 1. Yes 2. No 3. Neutral Study flaws described by author (please list): (include page #) _______________________________________________________________________ _______________________________________________________________________ Results Favor Use of This Assessment (according to you, the coder): 1. Yes 2. No 3. Neutral Additional study flaws not identified by author (please list): _______________________________________________________________________ _______________________________________________________________________ Notes and Comments: _______________________________________________________________________ _______________________________________________________________________ _______________________________________________________________________ _______________________________________________________________________

91

APPENDIX J PRIMARY STUDIES USED IN THE SYSTEMATIC ANALYSIS Total Number of Studies: 35 CITATION Batza Morris, M. (1995). The Diagnostic Drawing Series and the Tree Rating Scale: An isomorphic representation of multiple personality disorder, major depression, and schizophrenia populations. Art Therapy: Journal of the American Art Therapy Association, 12(2), 118-128. Bergland, C., & Moore Gonzalez, R. (1993). Art and madness: Can the interface be quantified? The Sheppard Pratt Art Rating Scale--an instrument for measuring art integration. American Journal of Art Therapy, 31, 81-90. Billingsley, G. (1998). The efficacy of the Diagnostic Drawing Series with substance-related disordered clients. Unpublished doctoral dissertation, Walden University. Brudenell, T. J. (1989). Art representations as functions of depressive state: Longitudinal studies in chronic childhood and adolescent depression. Unpublished master's thesis, College of Notre Dame, Belmont, CA. Coffey, T. M. (1997). The use of the Diagnostic Drawing Series with an adolescent population. Unpublished paper. Cohen, B. M., Hammer, J. S., & Singer, S. (1988). The Diagnostic Drawing Series: A systematic approach to art therapy evaluation and research. Arts in Psychotherapy: Special Research in the creative arts therapies., 15(1), 11-21. Cohen, B. M., & Heijtmajer, O. (1995). Identification of dissociative disorders: Comparing the SCID-D and dissociative experiences scale with the Diagnostic Drawing Series. Unpublished paper. Couch, J. B. (1994). Diagnostic Drawing Series: Research with older people diagnosed with organic mental syndromes and disorders. Art Therapy: Journal of the American Art Therapy Association, 11(2), 111-115. Easterling, C. E. (2000). Art therapy with elementary school students with symptoms of depression: The effects of storytelling, fantasy, daydreams, and art reproductions. Unpublished master's thesis, Florida State University, Tallahassee, FL. Eitel, K., Szkura, L., & Wietersheim, J. v. (2004). Do you see what I see? A study about the interrater-reliability in art therapy. Unpublished paper. LOCATION METHOD PsycINFO database

PsycINFO database

DDS archives DDS archives

DDS archives PsycINFO database

DDS archives

PsycINFO database

Manual search of FSU art therapy department dissertations Manual search: know the author personally

92

Fowler, J. P., & Ardon, A. M. (2002). Diagnostic Drawing Series and dissociative disorders: A Dutch study. Arts in Psychotherapy, 29(4), 221-230. Francis, D., Kaiser, D, & Deaver, S. P. (2003). Representations of attachment security in the bird's nest drawings of clients with substance abuse disorders. Art Therapy: Journal of the American Art Therapy Association, 20(3), 125-137. Gantt, L. M. (1990). A validity study of the Formal Elements Art Therapy Scale (FEATS) for diagnostic information in patients' drawings. Unpublished Dissertation, University of Pittsburgh, Pittsburgh, PA. Gulbro Leavitt, C. (1988). A validity study of the Diagnostic Drawing Series as used for assessing depression in children and adolescents. Unpublished doctoral dissertation, California School of Professional Psychology, Los Angeles, CA. Gulbro-Leavitt, C., & Schimmel, B. (1991). Assessing depression in children and adolescents using the Diagnostic Drawing Series modified for children (DDS-C). Arts in Psychotherapy, 18(4), 353-356. Gussak, D. (2004). Art therapy with prison inmates: A pilot study. Arts in Psychotherapy, 31(4), 245-259. Hacking, S., Foreman, D., & Belcher, J. (1996). The Descriptive Assessment for Psychiatric Art: A new way of quantifying paintings by psychiatric patients. Journal of Nervous and Mental Disease, 184(7), 425-430. Hacking, S., & Foreman, D. (2000). The Descriptive Assessment for Psychiatric Art (DAPA): Update and further research. Journal of Nervous and Mental Disease, 188(8), 525-529. Hays, R. E., & Lyons, S. J. (1981). The bridge drawing: A projective technique for assessment in art therapy. Arts in Psychotherapy, 8(3-sup-4), 207-217. Hyler, C. (2002). Children's drawings as representations of attachment. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. Johnson, K. M. (2004). The use of the Diagnostic Drawing Series in the diagnosis of bipolar disorder. Unpublished Dissertation, Seattle Pacific University, Seattle, WA. Kaiser, D. (1993). Attachment organization as manifested in a drawing task. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. Kessler, K. (1994). A study of the Diagnositic Drawing Series with eating disordered patients. Art Therapy: Journal of the American Art Therapy Association, 11(2), 116-118. Kress, T., & Mills, A. (1992). Multiple personality disorder and the Diagnostic Drawing Series: Further investigations. 93

PsycINFO database PsycINFO database

Manual search: know the author personally DDS archives

PsycINFO database

PsycINFO database PsycINFO database

PsycINFO database PsycINFO database Obtained from EVMS DDS archives Obtained from EVMS PsycINFO database DDS archives

Unpublished paper. Manning, T. M. (1987). Aggression depicted in abused children's drawings. Arts in Psychotherapy, 14, 15-24. McHugh, C. M. (1997). A comparative study of structural aspects of drawings between individuals diagnosed with major depressive disorder and bipolar disorder in the manic phase. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. Mills, A. (1989). A statistical study of the formal aspects of the Diagnostic Drawing Series of borderline personality disordered patients, and its context in contemporary art therapy. Unpublished master's thesis, Concordia University, Montreal, PQ. Mills, A., & Cohen, B. (1993). Facilitating the idenfitication of multiple personality disorder through art: The Diagnostic Drawing Series. In E. S. Kluft (Ed.), Expressive and functional therapies in the treatment of multiple personality disorder. Springfield, IL: Charles C Thomas. Neale, E. L. (1994). The Children's Diagnostic Drawing Series. Art Therapy: Journal of the American Art Therapy Association, 11(2), 119-126. Overbeck, L. B. (2002). A pilot study of pregnant women's drawings. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. Ricca, D. (1992). Utilizing the Diagnostic Drawing Series as a tool in differentiating a diagnosis between multiple personality disorder and schizophrenia. Unpublished master's thesis, Hahnemann University, Philadelphia, PA. Shlagman, H. (1996). The Diagnostic Drawing Series: A comparison of psychiatric inpatient adolescents in crisis with nonhospitalized youth. Unpublished master's thesis, College of Notre Dame, Belmont, CA. Wadlington, W. L., & McWhinnie, H. J. (1973). The development of a rating scale for the study of formal aesthetic qualities in the paintings of mental patients. Art Psychotherapy, 1, 201-220. Wilson, K. (2004). Projective drawing: Alternative assessment of emotion in children who stutter. Unpublished bachelor's thesis, Florida State University, Tallahassee, FL. Yahnke, L. (2000). Diagnostic Drawing Series as an assessment for children who have witnessed marital violence. Unpublished doctoral dissertation, Minnesota School of Professional Psychology, Minneapolis, MN.

PsycINFO database DDS archives

DDS archives

PsycINFO database

PsycINFO database Obtained from EVMS DDS archives

DDS archives

PsycINFO database Manual search: know the author personally PsycINFO database

94

APPENDIX K SUMMARY OF NUMERICAL DATA ON SPECIFIC DDS DRAWING ANALYSIS FORM (DAF) VARIABLES
DDS Citation ID Billingsley 1998 DDS DAF Variable information Percentage of incidence in all three of the DDS pictures; Chi-square (to address the research question How are the graphic profiles of substance-related disordered clients admitted to an outpatient treatment program different from the profiles already generated by the (DDS) with a control group and other disorders? (p. 6). All three of the DDS drawings combined and correlated with CES-D scores. Patient group compared to control group (percentages) Reported correlations, specifically, the degree of association of all the explanatory variables with the diagnostic category Reported percentages of occurrence. Discrimination measures reported. Reported Frequencies of structural elements across drawings which are significant or approaching significance. Variable Idiosyncratic Color reported for Free Drawing only, correlated to DSRS, phi=.37; variable People reported for Free Drawing only, correlated to DSRS, phi=.34; variable Animals reported for Free Drawing only, correlated to DSRS, phi=.24. Summary of significant variables. Correlations of MCMI-II categories and DDS variables on all 3 DDS pictures. Percentage of occurrence/non-occurrence of Groundline element in picture one of the DAF; Percentage of occurrence/non-occurrence of Falling Apart Tree elements in picture two of the DAF. Inter-rater reliability statistics. Tabulation of BPD Sample (percentages) Tabulation of Character-Disordered Males (percentages); Tabulation of Character-Disordered Adolescents (percentages). Statistically Significant Regressors (T-tests). Six DAF variables appeared prominently in each of the three pictures by the MPD subjects and were reported as percentages of incidence: Integration, Abstraction, Enclosure, Movement, Tree, Tilt Probability Table for Ho1, that the variables of the DAF applied to the CDDS would significantly discriminate between the treatment group (DSM-III-R diagnosis) and the Page #s in study where information is located pp. 85-93 pp. 155-156 Table(s) containing information Table 4

Brudenell 1989 Coffey 1997 Cohen 1988 Couch 1994 Fowler 2002 Gulbro 1988 Gulbro 1991

p. 43 pp. 21-24 pp. 16-19 pp. 112-113 p. 226 p. 105 p. 354

Table 13 Tables 1-4

Table 5 Table 25

Johnson 2004-a Johnson 2004-b Kessler 1994

p. 13

Table 4 Tables 10-12 Table 2 Table 3

p. 117

Mchugh 1997 Mills 1989-a

pp. 86-87 pp. 86-89 pp. 90-93, pp. 94-98 pp. 110-119

Tables 1 & 2 Table 5 Table 6 Table 7 Table 8

Mills 1989-b Mills 1993

Neale 1994-a

p. 47 p. 48 p. 49 p. 46 p. 46 p. 47 p. 121

Table 4 Table 6 Table 7 Table 2 Table 3 Table 5 Table 5

95

Neale 1994-b Neale 1994-c Ricca 1992-a Ricca 1992-b

Shlagman 1996-a Shlagman 1996-b Yahnke 2000

control group. Chi-squares for Ho2, that a cluster of DAF variables applied to the CDDS would emerge as criteria for diagnosing children with adjustment disorders. Characteristics of Drawings of Children Diagnosed with Adjustment Disorder: Variables with interrater reliability with Kappa > .50. The statistical results from the Likelihood Ratio Chi-Square test, presenting the collaborated scores of raters one, two and three; p-values provided. Cross-analysis of Picture 1s, Picture 2s, Picture 3s [The statistical results from the Likelihood Ratio Chi-Square test, presenting the collaborated scores of raters one, two and three; p-values provided.] DAF variables data. Variables with 25% (or more) difference between both sample groups. Inter-rater reliability between variables on the DDS; Frequency results from the CDDS Series total; Variables from the CDDS and their relationship to Primary Scales on the CBCL

p. 123 p. 124 pp. 66-70 pp. 72-74

Table 6 Table 7 Tables 3-5 Tables 6-8

pp. 35-93 pp. 67-69 p. 30 p. 32 p. 34

Tables 7-67 Tables 63-65 Table 4 Table 5 Table 6

96

APPENDIX L META-ANALYSIS OF INTER-RATER RELIABILITY GLOBAL EFFECT SIZE OF NINE STUDIES THAT USED CORRELATIONS
Obs 1 K 9 Q 42.7370 P LL_T_DOT .000000984 1.39667 Lower (CI) T_DOT 1.49870 Fisher Z UL_T_DOT 1.60074 Upper (CI)

No. of studies EFFECT SIZE (represents variability among the validities)

Obs

V_T_DOT SE_T_DOT

MODV

modsd

QV

qsd

1 .002710027

0.052058 0.10591 0.32543 0.11191 0.33453

Obs LbackZ 1 0.88463

BackZ

UbackZ

0.90491 0.92178 Fishers Z: global effect size

Confidence Intervals (these provide a method of visually detecting heterogeneity)


LLIM 0.63 1.30 1.87 1.25 1.01 1.28 1.22 1.88 0.57 ULIM z 1.51 1.75 2.50 1.70 1.52 1.78 2.97 3.64 1.33 Fisher min max -0.5 4 *------------------------------------------------------------------* 1.07 | | [-----*-----] | 1.53 | | [--*--] | 2.18 | | [-----*---] | 1.47 | | [--*--] | 1.27 | | [---*--] | 1.53 | | [--*---] | 2.09 | | [------------*-----------] | 2.76 | | [-----------*------------] | 0.95 | | [----*-----] | *------------------------------------------------------------------*

97

APPENDIX M META-ANALYSIS OF INTER-RATER RELIABILITY GLOBAL EFFECT SIZE OF SEVEN STUDIES THAT USED CORRELATIONS

Obs K

P LL_T_DOT T_DOT UL_T_DOT 1.36944 1.47289 MODV modsd 1.57633 QV qsd

1 7 32.7977 .000011469 Obs V_T_DOT SE_T_DOT

1 .002785515 Obs LBackZ 1 0.87857

0.052778 0.087086 0.29510 0.089650 0.29942 BackZ UBackZ

0.90013 0.91803 Global effect size for inter-rater reliability correlations

Confidence Intervals (these provide a method of visually detecting heterogeneity)


LLIM 1.28 1.30 1.87 0.63 0.57 1.25 1.01 ULIM z 1.78 1.75 2.50 1.51 1.33 1.70 1.52 Fisher min max -0.5 4 *------------------------------------------------------------------* 1.53 | | [--*---] | 1.53 | | [--*--] | 2.18 | | [-----*---] | 1.07 | | [-----*-----] | 0.95 | | [----*-----] | 1.47 | | [--*--] | 1.27 | | [---*--] | *------------------------------------------------------------------*

98

APPENDIX N META-ANALYSIS OF INTER-RATER RELIABILITY POTENTIAL MEDIATING VARIABLE: RATER TRAINING VS. NO RATER TRAINING

Obs 1

Ratertrng 0 Some or all raters not trained

K 5 5 studies, raters not trained

LL_T_DOT T_DOT UL_T_DOT V_T_DOT 1.29672 1.40970 1.52267 0.003322

9.1174 0.058230

2 Rater(s) trained

4 20.6794 0.000123

1.65501 1.89269

2.13038

0.014706

Obs SE_T_DOT 1 2

MODV

modsd

QV

qsd

0.05764 0.02125 0.14578 0.02173 0.14742 0.12127 0.34665 0.58877 0.44037 0.66360

Obs 1

LBackZ 0.86088

BackZ

UBackZ 0.88743 0.90916

Studies that did not train raters Global effect size in studies have almost equal reliabilities that did not train all raters 2 0.92954 0.95561 0.97217 Global effect size in studies that did train all raters (higher than studies that did not)

Obs 1

Q 42.7370 Variation among all reliabilities of the 9 studies

QBetween

Pb 12.9402 .000321600

Qwithin

Pw

29.7968 .000103462 There is still 29.79 variability that cannot be explained by rater training

With rater training factor, this number explains the variability more than w/o this information

99

APPENDIX O META-ANALYSIS OF INTER-RATER RELIABILITY POTENTIAL MEDIATING VARIABLE: AUTHOR RATER VS. AUTHOR NOT RATER

Obs AuthorRater 1

LL_T_DOT T_DOT UL_T_DOT V_T_DOT 1.34215 1.45846 1.57476 0.003521

0 6 32.5147 0.000005 Primary Author was not a rater

1 3 Primary Author was a rater

8.2254 0.016363

1.42058 1.63317

1.84576

0.011765

Obs SE_T_DOT 1 2

MODV

modsd

QV

qsd

0.05934 0.11626 0.34097 0.12076 0.34750 0.10847 0.10986 0.33145 0.34140 0.58429 BackZ UBackZ 0.89735 0.91778 0.92651 0.95135 Reliability is a bit higher when author is rater

Obs LBackZ 1 0.87219 2 0.88972

Obs

QBetween 1.99688 This number is low, which means that whether the author is rater or not does not help explain the variability among the 9 studies

Pb 0.15762

Qwithin

Pw

1 42.7370

40.7402 .000000908

100

APPENDIX P META-ANALYSIS OF INTER-RATER RELIABILITY POTENTIAL MEDIATING VARIABLE: CODER FOUND STUDY SUPPORTED TEST VS. NEUTRAL ON WHETHER STUDY SUPPORTED TEST

Obs 1

CodrFav 1 Coder found study supported test

Q 4 18.6083

P 0.000329

LL_T_DOT T_DOT UL_T_DOT 1.55438 1.70047

V_T_DOT

1.84656 .005555556

2 Coder neutral on whether study supported test

9.8227

0.043522

1.16398 1.30655

1.44912 .005291005

Obs SE_T_DOT 1 2

MODV

modsd

QV

qsd

0.074536 0.11562 0.34003 0.13091 0.36181 0.072739 0.03851 0.19624 0.04395 0.20965

Obs LBackZ 1 0.91450

BackZ 0.93547 Coder found study supported test gives higher reliability

UBackZ 0.95142

2 0.82233

0.86340 Coder neutral, so reliability goes down a bit.

0.89552

Obs

QBetween 14.3060 14/42 means almost 1/3 of variability is explained by this factor (but 2/3 is still not explained)

Pb .000155369

Qwithin

Pw

1 42.7370

28.4310 .000183652

101

APPENDIX Q META-ANALYSIS OF INTER-RATER RELIABILITY POTENTIAL MEDIATING VARIABLE: STUDY PUBLISHED VS. STUDY NOT PUBLISHED

Obs 1

Publish 0 Study not published

K 5

P LL_T_DOT T_DOT UL_T_DOT V_T_DOT 1.39330 1.51258 1.63186 0.003704

24.8186 .000054715

1 Study published No. of studies

17.7246 .000501290

1.26387 1.46085

1.65784

0.010101

EFFECT SIZE (represents variability among the validities) modsd QV qsd

Lower (CI)

Fisher Z

Upper (CI)

Obs SE_T_DOT 1 2

MODV

0.06086 0.09638 0.31046 0.10055 0.31710 0.10050 0.19831 0.44532 0.28152 0.53059

Obs LBackZ BackZ UBackZ 1 0.88389 0.90740 0.92633 2 0.85213 0.89782 0.92993 Obs Q QBetween 0.19383 Whether the study is published or not doesnt tell us very much about the variability Pb 0.65975 Qwithin Pw

1 42.7370

42.5432 .000000409

102

APPENDIX R CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS AUTHOR FAVOR OVERALL ANALYSIS AND DESCRIPTIVES Concurrent Validity: Confidence Intervals for Author Favor
LLIM -0.00 -0.99 -0.36 0.38 -0.21 -0.22 -0.21 -0.10 -0.38 0.09 -0.34 ULIM z 1.96 0.76 0.16 1.01 0.43 0.22 0.23 0.34 0.06 0.53 0.39 Fisher min max -1 1.5 *-----------------------------------------------------------------* 0.98 | [------------------------- * > -0.12 |[----------------------*-|------------------] | -0.10 | [------*-|---] | 0.69 | | [-------*-------] | 0.11 | [----|--*--------] | 0.00 | [----*----] | 0.01 | [----*------] | 0.12 | [-|--*-----] | -0.16 | [-----*---|] | 0.31 | | [----*-----] | 0.03 | [--------|*--------] | *-----------------------------------------------------------------*

Concurrent Validity: Confidence Intervals for Author Favorestimated (real) validity


LL_ UL_ T_ min max T_ T_ DOT -0.2 0.8 DOT DOT * ---------------------------------------------------------------------* 0.01 0.17 0.09 | [------*--] | *---------------------------------------------------------------------*

103

APPENDIX S CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS AUTHOR FAVOR OVERALL META-ANALYSIS, ALL STUDIES

Obs 1

K 11 No. of studies

LL_T_DOT

T_DOT

UL_T_DOT

V_T_DOT SE_T_DOT 0.041849 Standard Error

29.4230 .00106392 .008025498 0.090049 EFFECT SIZE (represents variability among the validities) Lower (CI) Fisher Z

0.17207 .001751313 Upper (CI) Variance

Obs

MODV

modsd

QV

qsd

1 0.037417 0.19344 0.038640 0.19657

Correlation Obs 1 LBackZ .008025325 BackZ 0.089806 UBackZ 0.17039

Lower bound Correlation Upper bound

104

APPENDIX T CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS AUTHOR FAVOR MODEL BY SAMPLING, ALL STUDIES

Obs AuthrFav K 1 2 3 0 2 2 6

LL_T_DOT T_DOT UL_T_DOT V_T_DOT -0.28463 0.36871 0.01860 0.18855 -0.04096 0.05367 1.02204 0.35850 0.14829 0.11111 0.00752 0.00233

2.6488 0.10363 9.7017 0.08414

1 3 14.5153 0.00070

Obs SE_T_DOT 1 2 3

MODV

modsd

QV

qsd

0.33333 0.36640 0.60531 0.37098 0.60908 0.08671 0.14115 0.37570 0.14409 0.37959 0.04828 0.01315 0.11468 0.01334 0.11550

Correlation Obs LBackZ 1 -0.27718 BackZ UBackZ 0.35286 0.77070 Author found that study did not support test: highest reliability 2 0.01860 0.18635 0.34390 Author found that study did support test: lower reliability 3 -0.04094 0.05361 0.14722 Author neutral: lowest reliability The possible variability among the three groups Obs 1 Q 29.4230 Variation among reliabilities of all 11 studies QBetween 2.55721 Pb 0.27843 Indicates that the three groups are homogeneous (i.e., not significant, based on Q between) Qwithin 26.8657 There is still 26.8657 variability that cannot be explained by Author Favor Pw .000745726 Q-within is significant because Pw<.001. Indicates that the variability among the effect sizes can contribute to the Qwithin variabilities.

105

APPENDIX U CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS CODER FAVOR OVERALL ANALYSIS AND DESCRIPTIVES Concurrent Validity: Confidence Intervals for Coder Favor
LLIM -0.00 -0.36 0.38 -0.21 -0.22 -0.21 -0.10 -0.38 0.09 -0.34 -0.99 ULIM z 1.96 0.16 1.01 0.43 0.22 0.23 0.34 0.06 0.53 0.39 0.76 Fisher min max -1 1.5 *-----------------------------------------------------------------* 0.98 | [------------------------- * > -0.10 | [------*-|---] | 0.69 | | [-------*-------] | 0.11 | [----|--*--------] | 0.00 | [----*----] | 0.01 | [----*------] | 0.12 | [-|--*-----] | -0.16 | [-----*---|] | 0.31 | | [----*-----] | 0.03 | [--------|*--------] | -0.12 |[----------------------*-|------------------] | *-----------------------------------------------------------------*

Concurrent Validity: Confidence Intervals for Coder Favorestimated (real) validity


LL_ UL_ T_ min max T_ T_ DOT -0.2 0.8 DOT DOT * ---------------------------------------------------------------------* 0.01 0.17 0.09 | [------*--] | *---------------------------------------------------------------------*

106

APPENDIX V CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS CODER FAVOR OVERALL META-ANALYSIS, ALL STUDIES

Obs 1

K 11

LL_T_DOT

T_DOT

UL_T_DOT

V_T_DOT SE_T_DOT 0.041849 Standard Error

29.4230 .00106392

.008025498 0.090049 Lower (CI) Fisher Z

0.17207 .001751313 Upper (CI) Variance

No. of EFFECT studies SIZE (represents variability among the validities)

Obs

MODV

modsd

QV

qsd

1 0.037417 0.19344 0.038640 0.19657

Correlation Obs 1 LBackZ .008025325 BackZ 0.089806 UBackZ 0.17039

Lower bound Correlation Upper bound

107

APPENDIX W CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS CODER FAVOR MODEL BY SAMPLING, ALL STUDIES

Obs CodrFav 1 2 3 0 1 2

K 1 3 7 No. of studies

Q 0.0000

P .

LL_T_DOT T_DOT UL_T_DOT V_T_DOT -0.004755 0.97524 0.018598 0.18855 -0.042379 0.05170 Lower (CI) Fisher Z 1.95524 0.35850 0.14579 Upper (CI) 0.25000 0.00752 0.00230

14.5153 0.00070 9.8448 0.13134 EFFECT SIZE (represents variability among the validities)

NB: The first line (0) can be ignored because there was only one and shouldnt affect the overall data. Obs SE_T_DOT 1 2 3 MODV modsd QV qsd

0.50000 0.00000 0.00000 0.00000 0.00000 0.08671 0.14115 0.37570 0.14409 0.37959 0.04800 0.01034 0.10166 0.01073 0.10359

Correlation: Obs LBackZ BackZ UBackZ

1 -0.004755 0.75100 0.96073 2 0.018595 0.18635 0.34390 3 -0.042353 0.05166 0.14476

The possible variability among the three groups: Obs 1 Q 29.4230 Variation among reliabilities of all 11 studies QBetween 5.06286 Pb 0.079545 Indicates that the three groups are homogeneous (i.e., not significant, based on Q between) Qwithin 24.3601 There is still 24.3601 variability that cannot be explained by Coder Favor Pw .001993794 Q-within is not significant because Pw>.001. Indicates that the variability among the effect sizes does not contribute significantly to the Q-within variabilities.

NB: Most of the variability among the effect sizes comes from the Q within.

108

APPENDIX X CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS MANUSCRIPT TYPE OVERALL ANALYSIS AND DESCRIPTIVES Concurrent Validity: Confidence Intervals for Manuscript Type
LLIM -0.22 -0.21 -0.10 -0.38 0.09 -0.36 -0.00 0.38 -0.21 -0.34 -0.99 ULIM z 0.22 0.23 0.34 0.06 0.53 0.16 1.96 1.01 0.43 0.39 0.76 Fisher min max -1 1.5 *-----------------------------------------------------------------* 0.00 | [----*----] | 0.01 | [----*------] | 0.12 | [-|--*-----] | -0.16 | [-----*---|] | 0.31 | | [----*-----] | -0.10 | [------*-|---] | 0.98 | [------------------------- * > 0.69 | | [-------*-------] | 0.11 | [----|--*--------] | 0.03 | [--------|*--------] | -0.12 |[----------------------*-|------------------] | *-----------------------------------------------------------------*

Concurrent Validity: Confidence Intervals for Manuscript Type estimated (real) validity
LL_ UL_ T_ min max T_ T_ DOT -0.2 0.8 DOT DOT * ---------------------------------------------------------------------* 0.01 0.17 0.09 | [------*--] | *---------------------------------------------------------------------*

109

APPENDIX Y CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS MANUSCRIPT TYPE OVERALL META-ANALYSIS, ALL STUDIES

Obs 1

K 11

LL_T_DOT

T_DOT

UL_T_DOT

V_T_DOT SE_T_DOT 0.041849 Standard Error

29.4230 .00106392 .008025498 0.090049 Lower (CI) Fisher Z

0.17207 .001751313 Upper (CI) Variance

No. of EFFECT studies SIZE (represents variability among the validities)

Obs

MODV

modsd

QV

qsd

1 0.037417 0.19344 0.038640 0.19657

Correlation Obs 1 LBackZ .008025325 BackZ 0.089806 UBackZ 0.17039

Lower bound Correlation Upper bound

110

APPENDIX Z CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS MANUSCRIPT TYPE MODEL BY SAMPLING, ALL STUDIES

Obs manutyp K 1 2

LL_T_DOT T_DOT UL_T_DOT -0.05484 0.03684 0.11976 0.30333

V_T_DOT

1 6 10.8215 0.055036 2 5 12.1219 0.016468

0.12853 .002188184 0.48690 .008771930

(1) Journal articles and dissertations: low validity (2) Masters theses: high validity

Obs SE_T_DOT 1 2

MODV

modsd

QV

qsd

0.046778 0.015286 0.12364 0.01533 0.12379 0.093659 0.089056 0.29842 0.10036 0.31679

Obs LBackZ

BackZ UBackZ

1 -0.05479 0.03683 0.12783 2 0.11919 0.29436 0.45176

(1) Journal articles and dissertations: low validity (2) Masters theses: high validity The possible variability among the two groups: Obs 1 Q 29.4230 Variation among reliabilities of all 11 studies QBetween 6.47957 Pb 0.010912 Indicates that the three groups are homogeneous (i.e., not significant, based on Q between) Qwithin 22.9434 There is still 22.9434 variability that cannot be explained by Manuscript Type Pw .000745726 Q-within is significant because Pw<.001. Indicates that the variability among the effect sizes can contribute to the Qwithin variabilities.

111

APPENDIX AA CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS PATIENT GROUP OVERALL ANALYSIS AND DESCRIPTIVES Concurrent Validity: Confidence Intervals for Patient Group
LLIM -0.00 -0.22 -0.21 -0.10 -0.38 0.09 -0.36 0.38 -0.21 -0.34 -0.99 ULIM z 1.96 0.22 0.23 0.34 0.06 0.53 0.16 1.01 0.43 0.39 0.76 Fisher min max -1 1.5 *-----------------------------------------------------------------* 0.98 | [------------------------- * > 0.00 | [----*----] | 0.01 | [----*------] | 0.12 | [-|--*-----] | -0.16 | [-----*---|] | 0.31 | | [----*-----] | -0.10 | [------*-|---] | 0.69 | | [-------*-------] | 0.11 | [----|--*--------] | 0.03 | [--------|*--------] | -0.12 |[----------------------*-|------------------] | *-----------------------------------------------------------------*

Concurrent Validity: Confidence Intervals for Patient Group estimated (real) validity
LL_ UL_ T_ min max T_ T_ DOT -0.2 0.8 DOT DOT *---------------------------------------------------------------------* 0.01 0.17 0.09 | [------*--] | *---------------------------------------------------------------------*

112

APPENDIX BB CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS PATIENT GROUP OVERALL META-ANALYSIS, ALL STUDIES

Obs 1

K 11

LL_T_DOT

T_DOT

UL_T_DOT

V_T_DOT SE_T_DOT 0.041849 Standard Error

29.4230 .00106392 .008025498 0.090049 Lower (CI) Fisher Z

0.17207 .001751313 Upper (CI) Variance

No. of EFFECT studies SIZE (represents variability among the validities)

Obs

MODV

modsd

QV

qsd

1 0.037417 0.19344 0.038640 0.19657

Correlation Obs 1 LBackZ .008025325 BackZ 0.089806 UBackZ 0.17039

Lower bound Correlation Upper bound

113

APPENDIX CC CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS PATIENT GROUP MODEL BY SAMPLING, ALL STUDIES

Obs Pantgrp K 1 2

LL_T_DOT T_DOT UL_T_DOT -0.046300 0.04499 0.092022 0.27890

V_T_DOT

1 7 14.3133 0.026325 3 4 10.2503 0.016554

0.13627 .002169197 0.46578 .009090909

1=major mental illness: fairly low for group 1 3=disorders of childhood: validity fairly high

Obs SE_T_DOT 1 2

MODV

modsd

QV

qsd

0.046575 0.021039 0.14505 0.021621 0.14704 0.095346 0.087883 0.29645 0.095559 0.30913

Correlation: Obs LBackZ BackZ UBackZ

1 -0.046267 0.04496 0.13544 2 0.091763 0.27189 0.43478

The possible variability among the groups: Obs Q QBetween Pb Qwithin Pw

1 29.4230

4.85927 0.027498 24.5637 .003493698

Obs 1

Q 29.4230 Variation among reliabilities of all 11 studies

QBetween 4.85927

Pb 0.027498

Qwithin 24.5637

Pw

.003493698 Q-within is not significant Indicates that the There is still three groups are 24.5637 variability because Pw>.001. Indicates that the homogeneous (i.e., that cannot be variability among the not significant, explained by effect sizes does not based on Q Patient Grp contribute significantly to between) the Q-within variabilities.

114

APPENDIX DD CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS ASSESSMENT TYPE OVERALL ANALYSIS AND DESCRIPTIVES Concurrent Validity: Confidence Intervals for Assessment Type
LLIM -0.00 -0.22 -0.21 -0.10 0.09 0.38 -0.21 -0.34 -0.38 -0.36 -0.99 ULIM z 1.96 0.22 0.23 0.34 0.53 1.01 0.43 0.39 0.06 0.16 0.76 Fisher min max -1 1.5 *-----------------------------------------------------------------* 0.98 | [------------------------- * > 0.00 | [----*----] | 0.01 | [----*------] | 0.12 | [-|--*-----] | 0.31 | | [----*-----] | 0.69 | | [-------*-------] | 0.11 | [----|--*--------] | 0.03 | [--------|*--------] | -0.16 | [-----*---|] | -0.10 | [------*-|---] | -0.12 |[----------------------*-|------------------] | *-----------------------------------------------------------------*

Concurrent Validity: Confidence Intervals for Assessment Type estimated (real) validity
LL_ UL_ T_ min max T_ T_ DOT -0.2 0.8 DOT DOT * ---------------------------------------------------------------------* 0.01 0.17 0.09 | [------*--] | *---------------------------------------------------------------------*

115

APPENDIX EE CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS ASSESSMENT TYPE OVERALL META-ANALYSIS, ALL STUDIES

Obs 1

K 11

LL_T_DOT

T_DOT

UL_T_DOT

V_T_DOT SE_T_DOT 0.041849 Standard Error

29.4230 .00106392

.008025498 0.090049 Lower (CI) Fisher Z

0.17207 .001751313 Upper (CI) Variance

No. of EFFECT studies SIZE (represents variability among the validities)

Obs

MODV

modsd

QV

qsd

1 0.037417 0.19344 0.038640 0.19657

Correlation: Obs LBackZ BackZ UBackZ

1 .008025325 0.089806 0.17039

116

APPENDIX FF CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS ASSESSMENT TYPE MODEL BY SAMPLING, ALL STUDIES

Obs Test K 1 2 3

LL_T_DOT T_DOT UL_T_DOT 0.01182 0.12071 0.10645 0.29773 -0.29795 -0.13347

V_T_DOT SE_T_DOT 0.055556 0.097590 0.083918

1 5 7.91905 0.09459 2 3 9.43131 0.00895 3 3 0.14477 0.93017

0.22960 .003086420 0.48901 .009523810 0.03101 .007042254

Test: Recoding of tests into 3 categories: 2=Depression measures (DSRS, CES-D, CDI-C, CDI-P) (original categories: 3,6,7,8)

3=Attachment measures (RQ, ATM) (original categories: 4,12)

4,5,6 4=other drawing tests (DAP, FD) (original categories: 9,10) 5=CAT-R Questionnaire (Childrens Attitudes About TalkingRevised) (category 5) 6=Millon Clinical Multiaxial Inventory-II (MCMI-II) (original category: 11)

Obs MODV

modsd

QV

qsd

1 0.01512 0.12296 0.01600 0.12649 2 0.10616 0.32582 0.10695 0.32703 3 0.00000 0.00000 0.00000 0.00000

Correlation: Obs LBackZ 1 2 0.01182 0.10605 BackZ UBackZ 0.12012 0.22564 0.28923 0.45343

3 -0.28944 -0.13268 0.03100

The possible variability among the three groups: Obs 1 Q 29.4230 Variation among reliabilities of all 11 studies QBetween 11.9278 Pb .002569834 Indicates that the three groups are homogeneous (i.e., not significant, based on Q between) Qwithin 17.4951 Pw 0.025347

Q-within is not significant There is still 17.4951 variability because Pw>.001. Indicates that the that cannot be variability among the explained by effect sizes does not Test Type contribute significantly to the Q-within variabilities.

117

APPENDIX GG CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS YEAR OVERALL ANALYSIS AND DESCRIPTIVES Concurrent Validity: Confidence Intervals for Year
LLIM -0.00 -0.22 -0.21 -0.10 -0.38 0.09 0.38 -0.21 -0.36 -0.34 -0.99 ULIM z 1.96 0.22 0.23 0.34 0.06 0.53 1.01 0.43 0.16 0.39 0.76 Fisher min max -1 1.5 *-----------------------------------------------------------------* 0.98 | [------------------------- * > 0.00 | [----*----] | 0.01 | [----*------] | 0.12 | [-|--*-----] | -0.16 | [-----*---|] | 0.31 | | [----*-----] | 0.69 | | [-------*-------] | 0.11 | [----|--*--------] | -0.10 | [------*-|---] | 0.03 | [--------|*--------] | -0.12 |[----------------------*-|------------------] | *-----------------------------------------------------------------*

Concurrent Validity: Confidence Intervals for Yearestimated (real) validity


LL_ UL_ T_ min max T_ T_ DOT -0.2 0.8 DOT DOT *---------------------------------------------------------------------* 0.01 0.17 0.09 | [------*--] | *---------------------------------------------------------------------*

118

APPENDIX HH CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS YEAR OVERALL META-ANALYSIS, ALL STUDIES

Obs 1

K 11

LL_T_DOT

T_DOT

UL_T_DOT

V_T_DOT SE_T_DOT 0.041849 Standard Error

29.4230 .00106392 .008025498 0.090049 Lower (CI) Fisher Z

0.17207 .001751313 Upper (CI) Variance

No. of EFFECT studies SIZE (represents variability among the validities)

Obs

MODV

modsd

QV

qsd

1 0.037417 0.19344 0.038640 0.19657

Correlation: Obs 1 LBackZ .008025325 BackZ 0.089806 UBackZ 0.17039

Lower bound Correlation Upper bound

119

APPENDIX II CATEGORICAL FIXED-EFFECTS MODEL: ASSESSMENT ANALYSIS DATA VS YEAR MODEL BY SAMPLING, ALL STUDIES

Obs YR K 1 2 3

LL_T_DOT -0.10446 0.19755

T_DOT 0.00443 0.35448

UL_T_DOT 0.11332 0.51140 0.14704

1 5 7.05288 0.13312 2 3 6.78312 0.03366 3 3 0.29826 0.86146

-0.26389 -0.05843

Obs V_T_DOT SE_T_DOT 1 0.003086 2 0.006410 3 0.010989

MODV

modsd

QV

qsd

0.05556 0.011778 0.10853 0.012464 0.11164 0.08006 0.045992 0.21446 0.049586 0.22268 0.10483 0.000000 0.00000 0.000000 0.00000

Correlation: Obs 1 LBackZ -0.10408 BackZ 0.00443 The studies conducted before 1990 have very low validity 2 0.19502 0.34034 The studies conducted between 19902000 are most valid 3 -0.25793 -0.05836 The studies conducted after 2000 are less valid 0.14599 0.47104 UBackZ 0.11284

The possible variability among the three groups: Obs 1 Q 29.4230 Variation among reliabilities of all 11 studies QBetween 15.2887 Pb .000478742 Indicates that the three groups are heterogeneous (i.e., significant, based on Q between) Qwithin 14.1343 There is still 14.1343 variability that cannot be explained by Year Pw 0.078332 Q-within is not significant because Pw>.001. Indicates that the variability among the effect sizes does not contribute significantly to the Q-within variabilities.

120

APPENDIX JJ STUDY WEAKNESSES OF INTEREST


Author-identified study flaws Additional Study Flaws Identified by Coder

Subject-Related Flaws Lack random selection Small N Francis 2003, Gussak 2004, Overbeck 2002, Billingsley 1998, Ricca Shlagman 1996 1992 Batza 1995, Billingsley 1998, Brudenell 1989, Manning 1987, Wilson Cohen 1995, Couch 1994, Easterling 2000, 2004 Francis 2003, Gantt 1990, Gulbro 1988, Hacking 1996, Kessler 1994, Overbeck 2002, Ricca 1992, Yahnke 2000 Batza 1995, Brudenell 1989, Coffey 1997, Cohen 1995, Couch 1994, Kress 1992, Mchugh 1997, Mills 1989, Neale 1994 Brudenell 1989, Gulbro 1988, Hacking 1996, Johnson 2004, Mills 1993, Neale 1994 Brudenell 1989, Hacking 1996

Ss not matched

Clinical symptomatology too weak or varied to detect differences within subjects and/or differentiate between groups (i.e., confounding variable), thereby preventing generalization Data Collection Flaws

Researcher failed to adhere to standardized methods of Couch 1994, Easterling 2000 assessment administration Study limited in scope and/or methodology Brudenell 1989, Cohen 1995, Easterling 2000, Overbeck 2002 Assessment instrument-related flaws Assessment unable to detect specific diagnoses Problems encountered in placing assessment drawings within the graphic profile Rating instrument flaws Limited rating instrument Ricca 1992 Fowler 2002

Brudenell 1989, Gulbro 1988

Billingsley 1998, Couch 1994, Francis 2003, Hays 1981 Gantt 1990, Gulbro 1988, Gulbro 1991, Johnson 2004, Kress 1992, Mills 1989, Neale 1994, Ricca 1992 Mchugh 1997 Kress 1992 Batza 1995, Brudenell 1989 Fowler 2002, Neale 1994 Eitel 2004 Cohen 1995, Couch 1994, Kress 1992, Shlagman 1996 Cohen 1988, Gussak 2004, Kessler 1994 Coffey 1997, Hays 1981

Author's changes to rating system need improvement Inter-rater reliability flaws Only one rater used Lacking inter-rater reliability Unsatisfactory inter-rater reliability Incorrect statistical procedures used to calculate interrater reliability Inconsistencies/problems with rating procedures

121

Other (5) Weakness Category and Total Number of Weaknesses

Subject-related weaknesses (39)

Data collection weaknesses (18)

Assessment -related weaknesses (2) Rating instrument weaknesses (11)

Inter-rater reliability weaknesses (9)

No weaknesses identified (7) Procedures/findings not reported (0)

APPENDIX KK

CODER-IDENTIFIED STUDY WEAKNESSES

122

Batza 1995 Bergland 1993 Billingsley 1998 Brudenell 1989 Coffey 1997 Cohen 1988 Cohen 1995 Couch 1994 Easterling 2000 Eitel 2004 Fowler 2002 Francis 2003 Gantt 1990 Gulbro 1988 Gulbro 1991 Gussak 2004 Hacking 1996 Hacking 2000 Hays 1981 Hyler 2002 Johnson 2004 Kaiser 1993 Kessler 1994 Kress 1992 Manning 1987 Mchugh 1997 Mills 1989 Mills 1993 Neale 1994 Overbeck 2002 Ricca 1992 Shlagman 1996 Wadlington 1973 Wilson 2004 Yahnke 2000

APPENDIX LL

123

APPENDIX MM

124

APPENDIX NN

125

REFERENCES

Acton, B. (1996). A new look at human figure drawings: Results of a meta-analysis and drawing scale development. Unpublished Dissertation, Simon Fraser University, British Columbia, Canada. Aiken, L. R. (1997). Psychological testing and assessment. (9th ed.). Boston, MA: Allyn and Bacon. American Art Therapy Association (2004a). About art therapy. Retrieved 1.31.2004. http://www.arttherapy.org/aboutarttherapy/about.htm. American Art Therapy Association. (2004b). Education standards. Retrieved 9.10.2004. http://www.arttherapy.org/aboutaata/educationstandards.pdf. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Anderson, F. (2001). Needed: A major collaborative effort. Art therapy: Journal of the American Art Therapy Association, 18(2), 74-78. Animated Software Company, Internet Glossary of Statistical Terms. (n.d.[a]). Chi square. Retrieved February 3, 2004, from http://www.animatedsoftware.com/statglos/ sgchi_sq.htm. Animated Software Company, Internet Glossary of Statistical Terms. (n.d.[b]). t test. Retrieved February 3, 2004, from http://www.animatedsoftware.com/statglos/ sgttest.htm. Animated Software Company, Internet Glossary of Statistical Terms. (n.d.[c]). t test. Retrieved February 3, 2004, from http://www.animatedsoftware.com/statglos/ sgzscore.htm. APCA (n.d.). convergent validity. Retrieved February 28, 2005, from http://www.acpa.nche.edu/comms/comm09/dragon/dragon-recs.html.

126

Arrington, D. (1992). Art-based assessment procedures and instruments used in research. In H. Wadeson (Ed.), A Guide to Conducting Art Therapy Research, 141-159. Mundelein, IL: The American Art Therapy Association. Atkinson, L., Quarrington, B., Alp, I. E., & Cyr, J. J. (1986). Rorschach validity: An empirical approach to the literature. Journal of Clinical Psychology, 42(2), 360-362. Arthur, W., Bennett, W., & Huffcutt, A. I. (2001). Conducting weighted analysis using SAS. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. Batza Morris, M. (1995). The Diagnostic Drawing Series and the Tree Rating Scale: An isomorphic representation of multiple personality disorder, major depression, and schizophrenia populations. Art Therapy: Journal of the American Art Therapy Association, 12(2), 118-128. Becker, B. J. (1994). Combining significance levels. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 215-230). New York: Russell Sage Foundation. Becker, L. A. (1998) Effect size. Retrieved February 3, 2004, from Colorado University, Colorado Springs Web site: http://www.uccs.edu/~lbecker/psy590/es.htm. Begg, C. B. (1994). Publication bias. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 399-410). New York: Russell Sage Foundation. Bergland, C., & Moore Gonzalez, R. (1993). Art & madness: can the interface be quantified? American Journal of Art Therapy, 31, 81-90. Betts, D. J. (2003). Developing a projective drawing test: Experiences with the Face Stimulus Assessment (FSA). Art Therapy: Journal of the American Art Therapy Association, 20(2), 77-82. Billingsley, G. (1998). The efficacy of the Diagnostic Drawing Series with substance-related disordered clients. Unpublished doctoral dissertation, Walden University. Bowyer, K. A. (1995). Research using the Diagnostic Drawing Series and the PPAT with an elderly population. Unpublished masters thesis, George Washington University, Washington, DC. Brooke, S. (1996). A therapists guide to art therapy assessments: Tools of the trade. Springfield, IL: Charles C Thomas. Brudenell, T. J. (1989). Art representations as functions of depressive state: Longitudinal studies in chronic childhood and adolescent depression. Unpublished master's thesis, College of Notre Dame, Belmont, CA. Buck, J. N. (1948). The H-T-P technique, a qualitative and quantitative scoring manual. Journal

127

of Clinical Psychology Monograph Supplement, 4, 1-120. Burleigh, L. R., & Beutler, L. E. (1997). A critical analysis of two creative arts therapies. The Arts in Psychotherapy, 23(5), 375-381. Buros Institute, Test Reviews Online. (n.d.[a]). Retrieved May 24, 2004, from http://buros.unl.edu/buros/jsp/category.html. Buros Institute, Test Reviews Online. (n.d.[b]). Category list of test titles: Personality. Retrieved May 24, 2004, from http://buros.unl.edu/buros/jsp/clists.jsp?cateid=12&catename =Personality. Burt, H. (1996). Beyond practice: A postmodern feminist perspective on art therapy research. Art Therapy: Journal of the American Art Therapy Association,13(1), 12-19. Bushman, B. J. (1994). Vote-counting procedures in weighted analysis. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 193-214). New York: Russell Sage Foundation. Case, C. (1998). Brief encounters: Thinking about images in assessment. INSCAPE, 3(1), 26-33. Chapman, L. J., & Chapman, J. P. (1967). Genesis of popular but erroneous psychodiagnostic observations. Journal of Abnormal Psychology. 72(3), 193-204. Coffey, T. M. (1997). The use of the Diagnostic Drawing Series with an adolescent population. Unpublished paper. Cohen, B. M. (Ed.). (1985). The Diagnostic Drawing Series Handbook. Unpublished handbook. Cohen, B. M. (Ed.). (1986/1994). The Diagnostic Drawing Series Rating Guide. Unpublished guidebook. Cohen, B. M., Hammer, J., & Singer, S. (1988). The Diagnostic Drawing Series (DDS): A systematic approach to art therapy evaluation and research. Arts in Psychotherapy, 15(1), 11-21. Cohen, B. M., & Heijtmajer, O. (1995). Identification of dissociative disorders: Comparing the SCID-D and dissociative experie nces scale with the Diagnostic Drawing Series. Unpublished paper. Cohen, B., & Mills, A. (2000). Report on the Diagnostic Drawing Series. Unpublished paper. Alexandria, VA: The DDS Project. Cohen, B. M., Mills, A., & Kijak, A. K. (1994). An introduction to the Diagnostic Drawing Series: A standardized tool for diagnostic and clinical use. Art Therapy: Journal of the American Art Therapy Association,11(2), 105-110.

128

Cohen, F. W., & Phelps, R. E. (1985). Incest markers in children's artwork. Arts in Psychotherapy, 12(4), 265-283. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Colaizzi, P. F. (1978). Psychological research as the phenomenologist views it. In R. S. Vale & M. King (Eds.). Existential-phenomenological alternatives for psychology (pp. 48-71). New York: Oxford University Press. Conard, F. (1992). The arts in education and a meta-analysis. Unpublished Dissertation, Purdue University. Cooper, H. (1998). Synthesizing research: A guide for literature reviewers (3rd ed.). Thousand Oaks, CA: Sage Publications. Cooper, H., & Ribble, R. G. (1989). Influences on the outcome of literature searches for integrative research reviews. Knowledge: Creation, Diffusion, Utilization, 10, 179-201. Couch, J. B. (1994). Diagnostic Drawing Series: Research with older people diagnosed with organic mental syndromes and disorders. Art Therapy: Journal of the American Art Therapy Association, 11(2), 111-115. Cox, C. T. (Moderator), Agell, G., Cohen, B., & Gantt, L. (1998, November). Are you assessing what Im assessing? Lets take a look! Panel presented at the meeting of the American Art Therapy Association, Portland, OR. Cox, C. T. (Moderator), Agell, G., Cohen, B., & Gantt, L. (1999, November). Are you assessing what Im assessing? Lets take a look! Round two. Panel presented at the meeting of the American Art Therapy Association, Orlando, FL. Cuadra, C. A., & Katter, R. V. (1967). Opening the black box of relevance. Journal of Documentation, 23, 291-303. Davidson, D. (1977). The effects of individual differences of cognitive style on judgments of document relevance. Journal of the American Society for Information Science, 8, 273284. Dawes, R. M. (1999). Two methods for studying the incremental validity of a Rorschach variable. Psychological Assessment, 11(3), 297-302. Dawson, C. F. S. (1984). A study of selected style and content variables in the drawings of depressed and nondepressed adults. Unpublished dissertation, University of North Dakota, Grand Forks, ND. Deaver, S. P. (2002). What constitutes art therapy research? Art therapy: Journal of the

129

American Art Therapy Association, 19(1), 23-27. Easterling, C. E. (2000). Art therapy with elementary school students with symptoms of depression: The effects of storytelling, fantasy, daydreams, and art reproductions. Unpublished master's thesis, Florida State University, Tallahassee, FL. Eitel, K., Szkura, L., & Wietersheim, J. v. (2004). Do you see what I see? A study about the interrater-reliability in art therapy. Unpublished paper. Elkins, D. E., Stovall, K., Malchiodi, C. A. (2003). American Art Therapy Association, Inc.: 2001-2002 membership survey report. Art therapy: Journal of the American Art Therapy Association, 20(1), 28-34. Evidence Based Emergency Medicine, New York Academy of Medicine. (n.d.) Odds ratio. Retrieved February 3, 2004, from http://www.ebem.org/definitions.html#sectO. Feder, B., & Feder, E. (1998). The art and science of evaluation in the arts therapies: How do you know whats working? Springfield, IL: Charles C Thomas. Fenner, P. (1996). Heuristic research study: Self- therapy using the brief image- making experience. Arts in Psychotherapy, 23(1), 37-51. Fowler, J. P., & Ardon, A. M. (2002). Diagnostic Drawing Series and dissociative disorders: A Dutch study. Arts in Psychotherapy, 29(4), 221-230. Fraenkel, J. R., & Wallen, N. E. (2003). How to design and evaluate research in education (5th ed.). New York, NY: McGraw-Hill. Francis, D., Kaiser, D, & Deaver, S. P. (2003). Representations of attachment security in the bird's nest drawings of clients with substance abuse disorders. Art Therapy: Journal of the American Art Therapy Association, 20(3), 125-137. Furth, G. (1988). The secret word of drawings: Healing through art. Boston, MA: Sigo Press. Gantt, L. (1986). Systematic investigation of art works: Some research models drawn from neighboring fields. American Journal of Art Therapy, 24(4), 111-118. Gantt, L. (1990). A validity study of the Formal Elements Art Therapy Scale (FEATS) for diagnostic information in patients drawings. Unpublished doctoral dissertation, University of Pittsburgh, Pittsburgh, PA. Gantt, L. (1992). A description and history of art therapy assessment research. In H. Wadeson (Ed.), A Guide to Conducting Art Therapy Research, 119-139. Mundelein, IL: The American Art Therapy Association. Gantt, L. (2004). The case for formal art therapy assessments. Art Therapy: Journal of the

130

American Art Therapy Association, 21(1), 18-29. Gantt, L., & Tabone, C. (1998). The Formal Elements Art Therapy Scale: The Rating Manual. Morgantown, WV: Gargoyle Press. Gantt, L., & Tabone, C. (2001, November). Measuring clinical changes using art. Paper presented at the meeting of the American Art Therapy Association, Albuquerque, NM. Garb, H. N. (2000). Projective techniques and the detection of child sexual abuse. Child Maltreatment: Journal of the American Professional Society on the Abuse of Children, 5(2), 161-168. Garb, H. N., Florio, C. M., & Grove, W. M. (1998). The validity of the Rorschach and the Minnesota Multiphasic Personality Inventory: Results from meta-analyses. Psychological Science, 9(5), 402-404. Garb, H. N., Wood, J. M., Nezworski, M. T., Grove, W. M., & Stejskal, W. J. (2001). Toward a resolution of the Rorschach controversy. Psychological Assessment, 13(4), 423-448. Glass, G., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage Publications. Golomb, C. (1992). The Childs Creation of a Pictorial World. Berkeley: University of California Press. Google (n.d.). Criterion validity. Retrieved March 6, 2005, from http://www.google.com/search?hl=en&lr=&ie=UTF8&oi=defmore&q=define:Criterion+ Validity. Groth-Marnat, G. (1990). Handbook of psychological assessment (2nd ed.). New York, NY: Wiley. Gulbro Leavitt, C. (1988). A validity study of the Diagnostic Drawing Series as used for assessing depression in children and adolescents. Unpublished doctoral dissertation, California School of Professional Psychology, Los Angeles, CA. Gulbro-Leavitt, C., & Schimmel, B. (1991). Assessing depression in children and adolescents using the Diagnostic Drawing Series modified for children (DDS-C). Arts in Psychotherapy, 18(4), 353-356. Gulliver, P. (circa 1970s). Art therapy in an assessment center. INSCAPE, 12, pp. unknown. Gussak, D. (2004). Art therapy with prison inmates: A pilot study. Arts in Psychotherapy, 31(4), 245-259. Hacking, S. (1999). The psychopathology of everyday art: A quantitative study. Dissertation,

131

University of Keele, Sheffield, UK. Published online at http://www.musictherapyworld.de/modules/archive/stuff/papers/Hacking.pdf Hacking, S. (2001). Psychopathology in paintings: A meta-analysis of studies using paintings by psychiatric patients. British Journal of Medical Psychology, 74(Pt1), 35-45. Hacking, S., Foreman, D., & Belcher, J. (1996). The Descriptive Assessment for Psychiatric Art: A new way of quantifying paintings by psychiatric patients. Journal of Nervous and Mental Disease, 184(7), 425-430. Hacking, S., & Foreman, D. (2000). The Descriptive Assessment for Psychiatric Art (DAPA): Update and further research. Journal of Nervous and Mental Disease, 188(8), 525-529. Hadley, R. G., & Mitchell, L. K. (1995). Counseling research and program evaluation. Pacific Grove, CA: Brooks/Cole Publishing Company. Hagood, M. M. (1990). Art therapy research in England: Impressions of an American art therapist. The Arts in Psychotherapy, 17, 75-79. Hagood, M. M. (1992). Diagnosis or dilemma: Drawings of sexually abused children. British Journal of Projective Psychology. 37(1), 22-33. Hagood, M. M. (2002). A correlational study of art-based measures of cognitive development: Clinical and research implications for art the rapists working with children. Art therapy: Journal of the American Art Therapy Association, 19(2), 63-68. Hagood, M. M. (2004). Commentary. Art Therapy: Journal of the American Art Therapy Association, 21(1), 3. Hall, J. A., Tickle-Degnen, L. T., Rosenthal, R., & Mosteller, F. (1994). Hypotheses and problems in research synthesis. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 17-28). New York: Russell Sage Foundation. Hammer, E. F. (1958). The clinical application of projective drawings. Springfield, IL: Charles C Thomas. Hardiman, G. W., Liu, F. J., & Zernich, T. (1992). Assessing knowledge in the visual arts. In G. Cupchik & J. Laszlo (Eds), Emerging visions of the aesthetic process: Psychology, semiology, and philosophy (pp. 171-182). New York, NY: Cambridge University Press. Harris, D. B. (1963). Childrens drawings as measures of intellectual maturity. New York, NY: Harcourt, Brace, & World. Harris, J. B. (1996). Children's drawings as psychological assessment tools. Retrieved April 19, 2003, from http://www.iste.org/jrte/28/5/harris/article/introduction.cfm.

132

Harris, D. B., & Roberts, J. (1972). Intellectual maturity of children: Demographic and socioeconomic factors. Vital & Health Statistics, Series 2 (pp. 1-74). Hays, R. E., & Lyons, S. J. (1981). The bridge drawing: A projective technique for assessment in art therapy. Arts in Psychotherapy, 8(3-sup-4), 207-217. Hedges, L. V. (1994). Statistical considerations. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 29-38). New York: Russell Sage Foundation. Hedges, L. V., & Olkin, I. (1985). Statistical methods for weighted analysis. Orlando, FL: Academic Press. Hiller, J. B., Rosenthal, R., Bornstein, R. F., Berry, D. T. R., & Brunell-Neuleib, S. (1999). A comparative meta-analysis of Rorschach and MMPI validity. Psychological Assessment, 11(3), 278-296. Horovitz, E. G. (Moderator), Agell, G., Gantt, L., Jones, D., & Wadeson, H. (2001, November). Upholding beliefs: Art therapy assessment, training, and practice. Panel presented at the meeting of the American Art Therapy Association, Albuquerque, NM. Hrdlicka, A. (1899). Art and literature in the mentally abnormal. American Journal of Insanity, 55, 385-404. Hunsley, J. & Bailey, J. M. (1999). The clinical utility of the Rorschach: Unfulfilled promises and an uncertain future. Psychological Assessment, 11(3), 266-277. Hyler, C. (2002). Children's drawings as representations of attachment. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. Johnson, K. M. (2004). The use of the Diagnostic Drawing Series in the diagnosis of bipolar disorder. Unpublished Dissertation, Seattle Pacific University, Seattle, WA. Julliard, K. N., & Van Den Heuvel, G. (1999). Susanne K. Langer and the foundations of art therapy. Art Therapy: Journal of the American Art Therapy Association, 16(3), 112-120. Kahill, S. (1984). Human figure drawing in adults: An update of the empirical evidence, 19671982. Canadian Psychology, 25(4), 269-292. Kaiser, D. (1993). Attachment organization as manifested in a drawing task. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. Kaplan, F. F. (1991). Drawing assessment and artistic skill. Arts in Psychotherapy, 18, 347-352. Kaplan, F. (2001). Areas of inquiry for art therapy research. Art therapy: Journal of the American Art Therapy Association, 18(3), 142-147.

133

Kaplan, F. (2003). Art-based assessments. In C. A. Malchiodi (Ed.), Handbook of Art Therapy (pp. 25-35). New York, NY: The Guilford Press. Kay, S. R. (1978). Qualitative differences in human figure drawings according to schizophrenic subtype. Perceptual and Motor Skills, 47, 923-932. Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). New York, NY: Holt, Rinehart & Winston. Kessler, K. (1994). A study of the Diagnositic Drawing Series with eating disordered patients. Art Therapy: Journal of the American Art Therapy Association, 11(2), 116-118. Kinget, G. M. (1958). The Drawing Completion Test. In E. F. Hammer (Ed.), The clinical application of projective drawings (pp. 344-364). Springfield, IL: Charles C Thomas. Kirk, A., & Kertesz, A. (1989). Hemispheric contributions to drawing. Neuropsychologia, 27(6), 881-886. Klepsch, M., & Logie, L. (1982). Children draw and tell: An introduction to the projective uses of children's human figure drawings. New York, NY: Brunner/Mazel. Klopfer, W. G., & Taulbee, E. S. (1976). Projective tests. Annual Review of Psychology, 27(54), 3-567. Knapp, N. M. (1994). Research with diagnostic drawings for normal and Alzheimer's subjects, Art Therapy, 11(2), 131-138. Kress, T., & Mills, A. (1992). Multiple personality disorder and the Diagnostic Drawing Series: Further investigations. Unpublished paper. Kwiatkowska, H. Y. (1975). Family art therapy: Experiments with a new technique. In E. Ulman & P. Dachinger (Eds.), Art therapy in theory and practice (pp. 113-125). New York, NY: Schocken Books. Kwiatkowska, H. Y. (1978). Family therapy and evaluation through art. Springfield, IL: Charles C Thomas. Langevin, R., & Hutchins, L. M. (1973). An experimental investigation of judges ratings of Schizophrenics and non-schizophrenics paintings. Journal of Personality Assessment, 37(6), 537-543. Langevin, R., Raine, M., Day, D., & Waxer, K. (1975a). Art experience, intelligence and formal features in psychotics' paintings. Arts in Psychotherapy (study 1), 2(2), 149-158. Langevin, R., Raine, M., Day, D., Waxer, K. (1975b), Art experience, intelligence and formal features in psychotics' paintings. Arts in Psychotherapy (study 2), 2(2), 149-158. 134

Lehmann, H., & Risquez, F. (1953). The use of finger paintings in the clinical evaluation of psychotic conditions: A quantitative and qualitative approach. Journal of Mental Science, 99, 763-777. Levick, M. F. (2001). The Levick Emotional and Cognitive Art Therapy Assessment. (LECATA). Boca Raton. The South Florida Art Psychotherapy Institute. Loewy, J. V. (1995). A hermeneutic panel study of music therapy assessment with an emotionally disturbed boy. Unpublished Dissertation, New York University, New York. Lombroso, C., (1891). The man of genius. London: Walter Scott. Lowenfeld, V. (1939). The nature of creative activity. New York, NY: Harcourt, Brace. Lowenfeld, V. (1947). Creative and mental growth. New York, NY: Macmillan. Luzzatto, P. (1987). The internal world of drug-abusers: Projective pictures of self-object relationships: A pilot study. British Journal of Projective Psychology, 32(2), 22-33. MacGregor, J. M. (1989). The discovery of the art of the insane. Princeton, NJ: Princeton University Press. Machover, K. (1949). Personality projection in the drawing of the human figure. Oxford, England: Charles C Thomas. Maclagan, D. (1989). The aesthetic dimension of art therapy: Luxury or necessity. INSCAPE, Spring, 10-13. Malchiodi, C. (1994). Introduction to special section of art-based assessments. Art Therapy: Journal of the American Art Therapy Association 11, 2, 104. Manning, T. M. (1987). Aggression depicted in abused children's drawings. The Arts in Psychotherapy, l4, l5-24. McGlashan, T. H., Wadeson, H. S., Carpenter, W. T., & Levy, S. T. (1977). Art and recovery style from psychosis. Journal of Nervous and Mental Disease, 164(3), 182-190. McHugh, C. M. (1997). A comparative study of structural aspects of drawings between individuals diagnosed with major depressive disorder and bipolar disorder in the manic phase. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. McNiff, S. (1998). Art-based research. Philadelphia, PA: Jessica Kingsley. Meyer, G. J., & Archer, R. P. (2001). The hard science of Rorschach research: What do we know and where do we go? Psychological Assessment, 13(4), 486-502.

135

Miljkovitch, M., & Irvine, G. M. (1982). Comparison of drawing performances of schizophrenics, other psychiatric patients and normal schoolchildren on a draw-a-village task. Arts in Psychotherapy, 9, 203-216. Mills, A. (1989). A statistical study of the formal aspects of the Diagnostic Drawing Series of borderline personality disordered patients, and its context in contemporary art therapy. Unpublished master's thesis, Concordia University, Montreal, PQ. Mills, A., & Cohen, B. (1993). Facilitating the idenfitication of multiple personality disorder through art: The Diagnostic Drawing Series. In E. S. Kluft (Ed.), Expressive and functional therapies in the treatment of multiple personality disorder. Springfield, IL: Charles C Thomas. Mills, A., Cohen, B. M., & Meneses, J. Z. (1993a). Reliability and validity tests of the Diagnostic Drawing Series. Arts in Psychotherapy, 20, 83-88. Mills, A., Cohen, B. M., & Meneses, J. Z. (1993b). Reliability and validity tests of the Diagnostic Drawing Series: DDS study 77 naive raters, unpublished report. Arts in Psychotherapy, 20, 83-88. Mills, A., & Goodwin, R. (1991). An informal survey of assessment use in child art therapy. Art Therapy: Journal of the American Art Therapy Association, 8(2), 10-13. Murray, H. A. (1943). Thematic apperception test. Cambridge, MA: Harvard University Press. National Art Education Association (1994). Advisory. Computers and art education. Author. Neale, E. L. (1994). The Children's Diagnostic Drawing Series. Art Therapy: Journal of the American Art Therapy Association, 11(2), 119-126. Neale, E. L., & Rosal, M. L. (1993). What can art therapists learn from the research on projective drawing techniques for children? A review of the literature. The Arts in Psychotherapy, 20 (37-49). Nunnally, J. C. (1960). The place of statistics in psychology. Educational and Psychological Measurement, 20, 641-650. Orr, P. P. (2003). A hollow God: Technology's effects on paradigms and practices in secondary art education. Unpublished Dissertation, Purdue University, West Lafayette, IN. Orwin, R. G. (1994). Evaluating coding decisions. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 139-162). New York: Russell Sage Foundation. Oster, G. D., & Gould Crone, P. (2004). Using drawings in assessment and therapy: A guide for mental health professionals. New York, NY: Brunner-Routledge.

136

Oswald, F. L. (2003). Meta-analysis and the art of the average. Validity generalization: A critical review, 311-338. Overbeck, L. B. (2002). A pilot study of pregnant women's drawings. Unpublished master's thesis, Eastern Virginia Medical School, Norfolk, VA. Parker, K. C. H., Hanson, R. K., & Hunsley, J. (1988). MMPI, Rorschach, and WAIS: A metaanalytic comparison of reliability, stability, and validity. Psychological Bulletin, 103(3), 367-373. Phillips, J. (1994). Commentary on the assessment portion of the art therapy practice analysis survey. Art therapy: Journal of the American Art Therapy Association, 11(3), 151-152. Pigott, T. D. (1994). Methods for handling missing data in research synthesis. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 163-175). New York: Russell Sage Foundation. Price, D. (1965). Networks of scientific papers. Science, 149, 510-515. Quail, J. M., & Peavy, R. V. (1994). A phenomenologic research study of a client's experience in art therapy. Arts in Psychotherapy, 21(1), 45-57. Rankin, A. (1994). Tree drawings and trauma indicators: a comparison of past research with current findings from the DDS. Art Therapy: Journal of the American Art Therapy Association, 11(2), 127-130. Ricca, D. (1992). Utilizing the Diagnostic Drawing Series as a tool in differentiating a diagnosis between multiple personality disorder and schizophrenia. Unpublished master's thesis, Hahnemann University, Philadelphia, PA. Ritter, M., & Low, K. G. (1996). Effects of dance/movement therapy: A meta-analysis. Arts in Psychotherapy, 23(3), 249-260. Roback, H. B. (1968). Human figure drawings: Their utility in the clinical psychologists armamentarium for personality assessment. Psychological Bulletin, 70(1), 1-19. Rosal, M. L. (1992). Illustrations of art therapy research. In Wadeson, H. (Ed.). A Guide to Conducting Art Therapy Research (pp. 57-65). Mundelein, IL: The American Art Therapy Association. Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638-641. Rosenthal, R. (1984). Meta analytic procedures for social research. Beverley Hills, CA: Sage. Rosenthal, R. (1991). Weighted analytic procedures for social research (rev. edition). Newbury

137

Park, CA: Sage. Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 231-244). New York: Russell Sage Foundation. Rosenthal, R., Hiller, J. B., Bornstein, R. F., Berry, D. T. R., & Brunell-Neuleib, S. (2001). Meta-analytic methods, the Rorschach, and the MMPI. Psychological Assessment, 13(4), 449-451. Rosnow, R. L., & Rosenthal, R. (1996). Computing contrasts, effect sizes, and counternulls on other peoples published data: General procedures for research consumers. Psychological Methods, 1, 331-340. Rubin, J. A. (1987). Approaches to Art Therapy: Theory and Technique. New York, NY: Brunner/Mazel. Rubin, J. A. (1999). Art therapy: An introduction. Philadelphia, PA: Brunner/Mazel. Russell- Lacy, S., Robinson, V., Benson, J., & Cranage, J. (1979). An experimental study of pictures produced by acute schizophrenic subjects. British Journal of Psychiatry, 134, 195-200. Shlagman, H. (1996). The Diagnostic Drawing Series: A comparison of psychiatric inpatient adolescents in crisis with non-hospitalized youth. Unpublished master's thesis, College of Notre Dame, Belmont, CA. Scope, E. E. (1999). A meta-analysis of research on creativity: The effects of instructional variables. Unpublished Dissertation. Sharples, G. D. (1992). Intrinsic motivation and social constraints: A meta-analysis of experimental research utilizing creative activities. Unpublished Dissertation. Shoemaker-Beal, R. (1977). The significance of the first picture in art therapy. Paper presented at the Dynamics of Creativity: The Eighth Annual Conference of the American Art Therapy Association, Baltimore, MD. Silver, R. (1966). The role of art in the conceptual thinking, adjustment, and aptitudes of deaf and aphasic children. Unpublished Doctoral Dissertation, Columbia University, New York. Silver, R. A. (1983). Silver Drawing Test of Cognitive and Creative Skills. Seattle, WA: Special Child Publications. Silver, R. A. (1988, 1993). Draw A Story, Screening for Depression and Emotional Needs. New York, NY: Ablin Press.

138

Silver, R. A. (1990). Silver Drawing Test of Cognitive Skills and Adjustment. Drawing What You Predict, What You See, and What You Imagine. New York, NY: Albin Press. Silver, R. A. (1996). Silver Drawing Test of Cognition and Emotion (3rd ed.). New York, NY: Albin Press. Silver, R. A. (2002). Three art assessments: The Silver Drawing Test of cognition and emotion; draw a story: Screening for depression; and stimulus drawings and techniques. New York, NY: Brunner-Routledge. Silver, R. A. (2003). The Silver Drawing Test of Cognition and Emotion. In C. A. Malchiodi (Ed.), Handbook of Art Therapy (pp. 410-419). New York, NY: The Guilford Press. Silver, R., & Ellison, J. (1995a). Identifying and assessing self- images in drawings by delinquent adolescents. Arts in Psychotherapy, 22(4), 339-352. Silver, R., & Ellison, J. (1995b). Identifying and assessing self- images in drawings by delinquent Adolescents: Part 2. Arts in Psychotherapy, 22(4), 339-352. Simon, P. M. (1888). Les crits et les dessins des alins. Archivio di Antropologia Criminelle, Psichiatria e Medicina Legale, 3, 318-355. Sims, J., Bolton, B., & Dana, R. H. (1983). Dimensionality & concurrent validity of the Handler DAP anxiety index. Multivariate Experimental Clinical Research, 6(2), 69-79. Spangler, W. D. (1992). Validity of questionnaire and TAT measures of need for achievement: Two meta-analyses. Psychological Bulletin, 112(1), 140-154. Srivastava, A. K. (2002). Somatic Inkblot Series-I: A meta analysis. Journal of Projective Psychology & Mental Health, 9(1), 33-37. Stock, W. A. (1994). Sytematic coding for research synthesis. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 125-138). New York: Russell Sage Foundation. Stricker, G. & Gold, J. R. (1999). The Rorschach: Toward a nomothetically based, idiographically applicable configurational model. Psychological Assessment, 11(3), 240250. Suinn, R. M., & Oskamp, S. (1969). The predictive validity of projective measures: A fifteenyear evaluative review of research. Springfield, IL: Charles C Thomas. Swensen, C. H. (1968). Empirical evaluations of human figure drawings: 1957-1966. Psychological Bulletin, 70(1), 20-44. Tardieu, A. (1886). Etudes mdico-lgales sur la folie. Paris: JB Baillire.

139

Tate, R. (1998). An introduction to modeling outcomes in the behavioral and social sciences (2nd ed.). Edina, MN: Burgess Publishing. Taveggia, T. C. (1974). Resolving research controversy through empirical cumulation. Sociological Methods and Research, 2, 395-407. Teneycke, T. (1988). Eating disorders and affective disorders: Is there a connection? A study involving the Diagnostic Drawing Series. Unpublished bachelors thesis, University of Regina, Regina, Sask. Uhlin, D. M. (1978). Assessment of violent prone personality through art. British Journal of Projective Psychology & Personality Study, 23(1), 15-22. Ulman, E. (1965). A new use of art in psychiatric diagnosis. Bulletin of Art Therapy, 4, 91-116. Ulman, E., (1975). A new use of art in psychiatric diagnosis. In E. Ulman & P. Dachinger (Eds.), Art therapy in theory and practice (pp. 361-386). New York: Schocken. Ulman, E. (1992). A new use of art in psychiatric diagnosis. American Journal of Art Therapy, 30, 78-88. Ulman, E., & Levy, B. I. (1975). An experimental approach to the judgment of psychopathology from paintings. In E. Ulman & P. Dachinger (Eds.), Art therapy in theory and practice (pp. 393-402). New York: Schocken. Ulman, E., & Levy, B. I. (1992). An experimental approach to the judgment of psychopathology from paintings. American Journal of Art Therapy, 30, 107-112. Viglione, D. J. (1999). A review of recent research addressing the utility of the Rorschach. Psychological Assessment, 11(3), 251-265. Wadeson, H. (2002). The anti-assessment devils advocate. Art Therapy: Journal of the American Art Therapy Association,19(4), pp. 168-170. Wadeson, H. (2003). About this issue. Art Therapy: Journal of the American Art Therapy Association,20(2), p. 63. Wadeson, H. (2004). Commentary. Art Therapy: Journal of the American Art Therapy Association, 21(1), 3-4. Wadeson, H., & Carpenter, W. (1976). A comparative study of art expression of schizophrenic, unipolar depressive, and bipolar manic-depressive patients. Journal of Nervous and Mental Disease, 162(5), 334-344. Wadlington, W. L., & McWhinnie, H. J. (1973). The development of a rating scale for the study

140

of formal aesthetic qualities in the paintings of mental patients. Arts in Psychotherapy, 1(3-4), 201-220. Walsh, B., & Betz, N. (2001). Tests and assessment (4th Ed.). Upper Saddle River, NJ: Prentice Hall. West, M. M. (1998). Meta-analysis of studies assessing the efficacy of projective techniques in discriminating child sexual abuse. Child Abuse & Neglect, 22(11), 1151-1166. Wiersma, W. (2000). Research methods in education: An introduction (7th ed.). Boston, MA: Allyn and Bacon. Wilson, K. (2004). Projective drawing: Alternative assessment of emotion in children who stutter. Unpublished bachelor's thesis, Florida State University, Tallahassee, FL. Wolf, F. M. (1986). Meta analysis: Quantitative methods for research synthesis. Beverley Hills, CA: Sage. Wright, J. H., & Macintyre, M. P. (1982). The family drawing depression scale. Journal of Clinical Psychology, 38(4), 853-861. Yahnke, L. (2000). Diagnostic Drawing Series as an assessment for children who have witnessed marital violence. Unpublished doctoral dissertation, Minnesota School of Professional Psychology, Minneapolis, MN. Yazdani, S. (2002a). Reliability. Retrieved June 2, 2004, from http://216.239.39.104/search?q= cache:HAOfbC5GwqwJ:www.atgci.org/medical%2520education/reliability.ppt+inter+rat er+reliability+definition&hl=en. Yazdani, S. (2002b). Validity. Retrieved June 2, 2004, from http://216.239.51.104/search?q= cache:TyjgaTVEl7oJ:www.atgci.org/medical%2520education/validity.ppt+validity+defin ition+Yazdani&hl=en&ie=UTF-8.

141

BIOGRAPHICAL SKETCH

Donna J. Betts, ATR-BC, hails from Toronto, Canada. She received a Bachelor of Fine Arts from the Nova Scotia College of Art & Design in 1992, a Master of Arts in Art Therapy from the George Washington University in 1999, and a PhD in Art Education with a specialization in Art Therapy from the Florida State University in 2005. From 2002-2005, she worked as a teaching assistant while completing her doctoral studies. In 2004, Ms. Betts was the proud recipient of the Daisy Parker Flory Graduate Scholar Award, bestowed upon her by the Honor Society of Phi Kappa Phi at Florida State. A registered and board-certified art therapist, Ms. Betts began working with people who have eating disorders in Tallahassee, Florida, in 2003, and as an information assistant, writer and graphic designer for Florida State University Communications in 2004. From 1998-2002 Ms. Betts worked as an art therapist with children and adolescents with multiple disabilities in Washington, DC. The book Creative Arts Therapies Approaches in Adoption and Foster Care: Contemporary Strategies for Working With Individuals and Families, was conceived and edited by Ms. Betts and published by Charles C Thomas in 2003. In addition to this 16-chapter volume, Ms. Betts has published articles and has presented at various conferences and schools. Ms. Betts served on the Board of Directors of the American Art Therapy Association (AATA) from 2002-2004, and founded the Research Committee and Governmental Affairs Committee websites for the AATA (www.arttherapy.org) in 2001. She has served as the Recording Secretary for the National Coalition of Creative Arts Therapies Associations (NCCATA) since 1999, and as their web administrator since 2001 (www.nccata.org). In 2000, Ms. Betts received the Chapter Distinguished Service Award from the Potomac Art Therapy Association in Washington, DC. She promotes art therapy on her website, www.art-therapy.us.

142