You are on page 1of 4

Psychological Assessment © 2010 American Psychological Association

2010, Vol. 22, No. 1, 1– 4 1040-3590/10/$12.00 DOI: 10.1037/a0018811

Measurement and Assessment: An Editorial View

If a thing exists, it can be measured—not always well, at least at first, but it can be measured. Measurement
is a central component of assessment if we believe that fear, anxiety, intelligence, self-esteem, attention, and
similar latent variables exist and are useful to us in developing an understanding of the human condition and
leading us to ways to improve it. Clinical assessment, to which this journal is devoted, goes beyond
measurement to deal with the development of a broader understanding of individuals and phenomena than
measurement alone can provide. Yet without accurate, valid measurements, we are seriously handicapped in
our clinical endeavors.
The ability and skill to measure variables accurately is a cornerstone to progress in science as well; one need
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

only read in the history of physics, medicine, or biology to discern this fundamental truth. It is also a
This document is copyrighted by the American Psychological Association or one of its allied publishers.

cornerstone of excellence in professional practice in psychology as well as in research. Good measurement


practices precede and presage good assessment.
Much of what is published in Psychological Assessment deals with the development and the application of
measurement devices of various sorts with the end goal of applications in assessment practice. What is
submitted but not published largely deals with the same topics.
As the new Editor writing the inaugural editorial, I am focusing on this topic for two major reasons. The
first is that the most frequent reason why manuscripts are rejected in the peer-review process for Psychological
Assessment and other high-quality journals devoted to clinical or neuropsychological assessment is inadequate
attention to sound and high-quality measurement practices. The second reason is my surmise that measure-
ment as a science is no longer taught with the rigor that characterized the earlier years of professional
psychology, a position I have reviewed and argued in more detail elsewhere (Reynolds, 2008).
It seems that “with the growth of knowledge in areas such as cognitive development, learning, psychopa-
thology, and the biological bases of behavior (e.g., genetics, neuropsychology, and physiological psychology),
the teaching of measurement in professional psychology programs is competing more and more with other
content domains—and, losing” (Reynolds, 2008, p. 3). This shortchanging of measurement instruction is
reflected in numerous ways in clinical practice as well as in research.
I have much opportunity to be in contact with a variety of psychologists who practice clinical assessment
and engage in research in many disciplines within psychology. As a journal editor for nearly 17 years (over
three journals dealing with various aspects of clinical and neuropsychological assessment), an associate editor
of several journals, and a member of 15 or more editorial boards, I have read of lot of manuscripts that never
see the light of refereed publication.
Frequently, we see manuscript submissions that feature poor measuring devices, even though those
instruments and the constructs they presumably measure were developed and researched at considerable
expense and effort. Despite the time and money, the researchers often give no consideration to principles of
item writing, apply no empirical item-selection methods, and use inappropriate reliability estimates of the
scores (almost invariably referred to as the reliability of the test); sometimes reliability estimates are
conspicuously missing. Seldom is a table of specifications developed to ensure the representativeness of the
items to the levels and dimensions of the construct to be measured. The issues associated with domain
sampling are largely ignored, as are the nuances of item wording and item format, which clearly can affect
the response process and what is actually being measured. Even the issue of clear standardized instructions is
ignored in some cases. All too often, a single researcher or a small team will simply write a set of items and
begin research on a construct with no apparent awareness that the key first step has been omitted: namely,
conducting the necessary research on the measuring device itself, modifying it appropriately, trying out the
items again—and again if necessary—and then applying the instrument to the research on the construct in
question and making it available to other researchers.
Another measurement issue seen often in papers that are not accepted for publication concerns sampling of
the population and the ability of a research sample to address the research questions. One of the most frequent
issues encountered with submissions to Psychological Assessment in this regard is the use of college students
(an obvious sample of convenience) to answer questions about the structure of a test with clinical or referred
samples or to test hypotheses about clinical applications of a test in the population at large. Collegiate samples,
especially from a single university or campus, are rarely representative of anything beyond students on that
campus. Unless the population of interest is very narrowly defined, collegiate samples are not representative
of clinical populations, making it inappropriate to generalize outside of the sample (i.e., external validity).
Using such narrow samples that are not characteristic of the population at large most often results in an

1
2 EDITORIAL

underestimate of the variance or heterogeneity of the population at large on individual item scores as well as
any component scores. The net result is much chaos for the accurate derivation and interpretation of many
measurement statistics.
The use of college students can be justified in some studies where the goal is to provide findings that
generalize not to clinical populations but rather to other populations adequately sampled by using college
students. However, the appropriateness of such sampling must be demonstrated on the basis of empirical work
and not presumption, and even then with the understanding findings need to be replicated with samples of
other populations, unless the work is intended to be applied to the college student population only, which is
a worthy population of study itself.
Sample size is another recurring issue in the rejection of submissions to Psychological Assessment.
Measurement and psychological research in general have turned to many powerful modeling methods that are
essentially regression based and that require strong subject-to-variable ratios. Too many articles use small
samples that are overwhelmed by the methods of analysis, especially when many variables are involved, as
is common when item-level responding is analyzed.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Ignoring or shortchanging the development process for measures used in research adds noise, not clarity,
to the science of our profession. For the clinician, it can create overconfidence in the application of what
rightly should be a research scale and can cloud the accuracy of the outcome of an otherwise sound clinical
assessment.
The idea of using normative data as a reference sample seems often misunderstood by researchers and
clinicians as well. I have recounted the following examples elsewhere (Reynolds, 2008), but a brief retelling
seems appropriate here.
I monitor several professional psychology Listservs, mostly to respond to questions about testing and
assessment. One posting inquired about what intelligence test might be best for assessing intellectual function
in an adolescent with a drug addiction. Several suggestions followed, but there was a surprising level of
agreement with a post indicating that the psychologist could give any intelligence test, but the results would
be worthless because none of the tests had been normed specifically on adolescents with a drug addiction.
How such a normative sample of adolescents with a drug addiction would answer the referral question of this
young man’s current level of intellectual function was not considered by those who considered this a basic
truth. I have read textbooks that decry the fact that many tests used to diagnose disabilities have no norms
specific to these groups and argue against their use for this reason. How such norms might be used is not
addressed. Indeed, many tests do have specialized normative samples in addition to a set of norms represen-
tative of more general populations. Having a reference sample composed of individuals from a specific
diagnostic group can indeed be helpful, but each reference group comparison answers a different question
about the score of the person being assessed. In the above example, if one wanted to know “what is this
adolescent’s level of intellectual function relative to that of the population of age mates residing in the United
States,” a sample of adolescents with drug addiction is irrelevant. But if the question is “what is this
adolescent’s level of intellectual function relative to that of age mates with the same drug addiction issues who
also reside in the United States,” a different sample is required. The refinement of our research questions—
whether they are research on an individual patient or research on a construct or population of interest—is very
important, and our samples must match our questions. Otherwise, we can only answer the questions with
speculation.
The scale of measurement (nominal, ordinal, interval, or ratio) derived for an instrument also is commonly
misunderstood, especially with regard to its implications for what we can and cannot do with the numbers we
obtain. On several Listservs, and even in some state regulatory codes that govern diagnosis of disabilities or
eligibility for special programs, requirements for a specified “percentage” discrepancy in learning is indicated
as a requirement for diagnosis. I have even heard this approach used in presentations of scientific papers at
more than one association meeting. Some might say, for example, that a student must have a 50% discrepancy
between an achievement test score and current grade placement. Such a percentage would be calculated on the
basis of the difference between a classically derived grade equivalent (GE) on the achievement measure and
current grade placement (GP) divided by the grade placement. Here we have an ordinal scale number (GE)
being subtracted from a scale with little known properties (GP), producing a number with unknown
characteristics. After that number is divided by GP, we are left with a number with even lesser known
characteristics. That number is then used to classify an individual as disabled, or not, affecting access to
intervention programs, disability income, and other special programs! Typically there is no attempt to validate
such a procedure, and few psychologists understand why, from a measurement perspective, such mathematical
manipulations might be problematic to interpret.
I encounter both practice and research in which psychologists combine numbers derived from different
scales of measurement and draw wholly inappropriate conclusions (e.g., they conclude that a person with an
IQ ⫽ 90 is 50% more intelligent than a person with an IQ ⫽ 60 or argue that an IQ decline from 100 to 75
EDITORIAL 3

represents a 25% loss of intellectual function). They may know that such conclusions are erroneous but have
no clue why. Ordinal and interval data, the most common types of scores in psychology, cannot sensibly
withstand such calculations. Researchers, too, make computations from derivative indexes that mix scales of
measurement, clouding their interpretation.
The language of testing also has changed dramatically in some circles of the profession but has been slow
to change in other arenas. The greatest changes have been in the vernacular of reliability and validity.
Psychological Assessment now requires that references to reliability, validity, and other psychometric terms
conform to the recommendations in the American Educational Research Association, American Psychological
Association, and National Council on Measurement in Education (1999) Standards for Educational and
Psychological Testing. The changes in wording in the 1999 Standards are far more than cosmetic and are
intended to reinforce and promote a change in thinking about such concepts that fomented in the profession
for several decades before coalescence and implementation. Psychological Assessment, being an American
Psychological Association journal, intends to promote these Standards and their language, as adopted by the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

American Psychological Association. As an example, reliability refers to test scores, not tests, and validity
This document is copyrighted by the American Psychological Association or one of its allied publishers.

refers to the accuracy and appropriateness of test score interpretations—again, not to tests. Tests are frequently
referred to as reliable or valid, and this is now simply archaic; the continued shorthand provided by such
phrasing should cease. Consider the research question we read often in submitted manuscripts, “purpose was
to determine the validity of the XYZ test,” with no further specification. This question cannot be answered.
Consider the question, “is the Wechsler Adult Intelligence Scale (4th ed.; WAIS–IV) a valid test?” Again, the
question is inappropriate. Both researchers and clinicians ask such questions, but the questions must be
clarified. In this instance, a superior and acceptable question would be, “Is the WAIS–IV a valid measure of
intelligence for adults referred for clinical evaluation?” Validity must refer to a context and a construct, not
to a test, and is relevant to the interpretation of scores on a test (i.e., the attachment of meaning to performance
on a measuring device in psychology). An example from the physical sciences may help. If I possess a ruler
marked in inches certified by the Bureau of Standards and I ask, “Is that a valid ruler?” most would say yes.
Does that mean I can measure weight with it? At first one would laugh and say “no, no one would attempt
that; a ruler is to measure length, height, or distance in some way.” The answer even here is not so clear:
Weight can be determined with a ruler in special contexts. Suppose I know the weight per inch of a brick wall
but cannot move it onto a scale. I can measure its height and multiply by some known constant and determine
a weight that is quite “valid,” provided I have the validity evidence to back my interpretation. So even in the
physical realm, we can ask a superior question—“valid for what purpose?”—and look for evidence upon
which to base our answer.
In the area of psychological testing, such clarity is seldom achieved, which is why we must be even more
vigilant about our use of terminology. Interpretations of psychological tests that lack validity evidence are
often made under the rubric of “it is a valid test.” Each interpretation attributed to a test score must be
validated: Tests are neither valid nor invalid; only the interpretations of performance on the test are the proper
subject of validation research.
One particular approach to the validation of test scores as diagnostic indicators deserves additional comment
because of its frequency in unsuccessful articles. Often a contrasted-groups approach is used to determine
whether a measure can differentiate a diagnosis adequately. Too often, researchers ask whether a test for a
specific form of psychopathology can distinguish those who have it from those who do not—that is, a
comparison between a diagnosed sample with the disorder of interest and a typical sample of individuals. We
have learned over decades of research that such discriminations are relatively easy (i.e., distinguishing a
clinically diagnosed group from a typical nonclinical group can be done with most psychological tests and
represents no real advance). Rather, the more pertinent question, and the one that helps clinicians and
researchers the most, is whether test results can distinguish not only typical from atypical but also different
classes of psychopathology from one another. Tests whose results can be used to sort diagnostic groups into
the correct classifications with a high degree of accuracy are indeed useful and needed.
There are many other reasons why papers submitted to Psychological Assessment do not get accepted, but
these methodological and conceptual issues seem to take a large share, and they are remediable if addressed
before research projects are started. We see far too many good ideas slaughtered by poor measurement
practices and rote design. Results from properly designed works have a greater likelihood of impacting and
advancing research and improving practice. And as scientists, we should mourn the loss of good ideas because
of poor execution.
Perhaps I am mistaken, but the impetus for much of the error we see in rejected manuscripts appears to stem
from minimal course work or other formal training in measurement methods. As the knowledge explosion
continues in all sciences, and especially in the biological bases of behavior, and more demands are placed on
training programs, measurement seems to have been given less emphasis in doctoral programs in professional
4 EDITORIAL

psychology. Good assessment skills require sound measurement research and practices, as does good research
on the human condition.
One of the tasks of Psychological Assessment is to promote a strong science of clinical assessment as
practiced throughout professional psychology. To that end, I have attempted to pull together a more eclectic
group of Associate Editors and Consulting Editors than might be anticipated for a journal devoted to clinical
assessment. Our hope is to attract more and better manuscripts that deal with issues focusing on all aspects
of clinical assessment. Clinical assessment is growing in many areas of psychology, and clinical assessments
are used in more aspects of our society than ever before. We hope to reflect this diversity of work in the pages
of the journal. Psychological Assessment has, in the past, emphasized adult clinical assessment, and this will
remain a major focus of the journal. However, we hope to attract more manuscripts dealing with assessment
across the full lifespan. Linking clinical assessments to the biological bases of behavior and understanding the
likely transactional interactions between behavior and our biology are seen as important to the future of
clinical assessment as well. Clinical assessment practices are being applied to the prediction of the behavior
of groups, and clinical analytic methods are being used to study and to predict the course of terrorist activities,
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

and this work, too, will be welcomed. Federal government regulations regarding the determination of
disabilities, a major focus of many clinical practitioners in assessment, are being revamped conceptually as
well as practically, and research is direly needed on the effectiveness of new methods and measures of
disability determination under such reconceptualizations. The development of methods of clinical assessment
and methods of research on clinical assessment of culturally diverse populations is another area we hope to
enhance over the years of our stewardship. The mobility and increasing diversity of the population of the
world require it if we are to continue to have psychological assessments that are useful.
The complexity of new models and demands on clinical assessment and the populations to which they must
be applied also lead us to stress to the research community that to break new ground, one must lay a strong
foundation for multidisciplinary assessment research. This foundation must include the careful development
of measurement tools in all domains of assessment and the in-depth understanding of measurement terms and
concepts when communicating the research findings in submissions to the journal.
I am honored to have been recommended and subsequently chosen to lead what I see as an outstanding
group of Associate Editors and Consulting Editors in this effort. Knowing them and their work, the success
of Psychological Assessment in promoting the advancement of clinical assessment in research and practice
will stem from their efforts, and any failing will be mine for not taking full advantage of their talents and
dedication to the profession. With this issue, you will begin to see the earliest fruits of our collective efforts.
If you are involved in research with implications for clinical assessment, as broadly defined and practiced,
Psychological Assessment is interested in your work. Any and all recommendations for improvements to the
journal or our editorial process are welcome, as well, and can be sent directly to me.
Cecil R. Reynolds, Editor

References
American Educational Research Association, American Psychological Association, & National Council on Measurement in
Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational
Research Association.
Reynolds, C. R. (2008, April). Has any real understanding of measurement gone missing from the professional psychology
curriculum? Score, 30(2), 3– 4.

Received December 10, 2009


Revision received December 10, 2009
Accepted December 30, 2009 䡲

You might also like