You are on page 1of 5

P1: GAD

pl-nr2004.cls (03/25/2004 v1.1 LaTeX2e NERV document class) pp1358-nerv-494745 October 8, 2004 17:28

Neuropsychology Review, Vol. 14, No. 3, September 2004 (


C 2004)

A Critique of Miller and Rohling’s Statistical Interpretive


Method for Neuropsychological Test Data

Victor L. Willson1 and Cecil R. Reynolds1,2

A critical review of the 24-step procedure of Miller and Rohling’s (in press) proposed standardization
of clinician’s use of neuropsychological assessment batteries is presented. Each step is examined
for statistical sources of invalidity. It was concluded that parts of the procedure are quite vulnerable
to between-battery variability that cannot be easily estimated or controlled, leading to significant
errors in analysis and classification. A second fatal flaw is the failure to distinguish in the procedures
between standard error measurement and standard error of the estimate in calculations in several
steps. The purpose of the process remains viable, however, and is an important contribution toward
the improvement of clinical diagnosis.
KEY WORDS: neuropsychology; assessment; critique.

The intent of Miller and Rohling (in press) to stan- ertheless, in correct decisions at some successive stage),
dardize the interpretation of the test scores resultant from we will attempt to determine at least in general form the
a neuropsychological evaluation is laudible. The interplay probability of a correct decision based on the cumulative
between clinician’s expert knowledge and empirical test decisions at the steps proposed. Some steps will have no
information has now had a 100-year history, from the early incorrect probability associated with them, which will be
writings of Lightner Witmer (Witmer, 1902) and others noted. Error from step to step is not simply additive but
(e.g., Pintner, 1923) onward. The concern for idiosyn- increases geometrically.
cratic interpretations of data that result in misdiagnosis Step 1: Design and Administer a Test Battery. There
leads Miller and Rohling to propose a 24-step procedure is great potential for different diagnoses with different test
to evaluate test scores and produce consistent, defensible batteries. There is no empirical evidence in the research
evidence for diagnosis that is intended to be highly reli- literature that all test batteries of neuropsychological func-
able across diagnosticians. In particular, it is intended to tioning provide test scores that estimate a set of cognitive
be generalizable across test batteries that might be used functions identically. That is, for a test score xij
in the neuropsychological evaluation. If successful, this 
method would indeed be robust. Our concern is in its po- Xij = λij k Fk + eij (1)
tential for success and unwarranted levels of confidence where i = test i of construct I
in diagnoses deduced from this model. λ = regression (factor) loading on construct k, from
Our approach here is to examine each step for its 1 to K (including construct I )
susceptibility to either error or its probability of error. Be- j = battery J selected for the patient,
cause the probability of error in any statistical procedure It is improbable that all loadings on a particular kind of
based on successive steps is the product of the probabili- test (say digit span) are identical across test batteries. Note
ties of correct decisions at each step (although it may be that we assume that for many tests the loadings are not
possible for incorrect decisions at one stage to result, nev- zero for constructs other than the one principally intended.
1 Texas
This conclusion affects subsequent steps.
A & M University, College Station, Texas.
2 To whom correspondence should be addressed at Department of Educa- Here there is an additional, nonstatistical point of
tional Psychology, MS 4225, Texas A & M University, College Station, disagreement. In choosing to administer a flexible battery,
Texas 77843-4225; e-mail: crrh@earthlink.net Miller and Rohling (p. 10) state “test selection should

177
1040-7308/04/0900-0177/0 
C 2004 Plenum Publishing Corporation
P1: GAD
pl-nr2004.cls (03/25/2004 v1.1 LaTeX2e NERV document class) pp1358-nerv-494745 October 8, 2004 17:28

178 Willson and Reynolds

be based on knowledge of the literature and experience and Rohling do not specify a means of aggregating these
with the instruments.” A few other guidelines such as estimates into a single data point; we are relatively certain
recency of norming are offered; we agree with all of these it is difficult to do accurately and that clinicians will vary
rubrics. However, the ultimate argument for using a flexi- greatly in the methods they apply. The choice of method
ble versus a standard battery rests upon the argument that and the variability in such choices across clinicians will
a flexible battery can be chosen to assess best the patient’s both add unknown amounts of error to the equation that
presenting problem. Criticisms of standard batteries cen- cannot at this point be estimated accurately. In using a sin-
ter around their potential insensitivity to the heterogeneity gle method, such as class rank, Miller and Rohling ignore
of the patient population seen and subsequent lack of di- regression effects in prediction of less than perfectly cor-
agnostic specificity or utility. Miller and Rohling however related variables. This issue also complicates their goal of
recommend picking a battery based upon examiner char- consistency across clinicians since many different choices
acteristics and test characteristics without regard for the of procedures and ways of combining them will be chosen
presenting problem, and other relevant patient character- without strict guidance.
istics. If following the recommendations of Miller and Step 3: Convert all Scores to a Common Metric. The
Rohling, each clinician may have a different battery but it linear transformation of the test scores does not change the
would be the same battery for all patients seen by a specific variability in estimation of underlying neuropsychologi-
clinician. This seems to defeat the purpose of the flexible cal constructs. However, we feel compelled to elaborate
battery choice altogether. Each clinician then would have a on Miller and Rohling’s discussion of their preference for
unique but clinician-specific standard battery (that would normalization of scores. They leave it open to clinicians
of necessity vary as a function of age), which could in fact to normalize the T -score transformations or to use a linear
lead to more accurate diagnoses if clinicians maintain and transformation. To the extent that clinicians make different
monitor data on their own practices. choices, this variance or unreliability in choosing which
Step 2: Estimate Premorbid General Abilities. There transformation to apply (linear or normalization) will add
are two problems with this step. First, the construct of gen- inestimable error to the diagnostic process and adds fur-
eral ability is not defined. In the absence of a meaningful ther to the point above that consistency across clinicians
definition, it appears that general ability is whatever the cannot be achieved following their recommendations.
particular combination of test battery and demographic Miller and Rohling provide some discussion of when
information available can be factor analyzed to produce. to choose normalization of score distributions and when
This vagueness will produce great variation in any con- not to do so. Our preference is for linear transformations
struct score developed from it. Equation (1) is invoked in the case of well-designed standardization samples and
here, except now the equation becomes recommend that only well standardized tests in fact be
  used (thus negating the debate between normalized and
gm = Xij + b1 d1 , (2) linear T scores).
On page 13, the authors note that “when normative
where ability construct gm is the EPGA, estimated premor- data are highly skewed . . . appropriate transformations
bid general ability estimated from the available test scores should be conducted to normalize the distribution prior to
and demographics, and b1 is the regression loading for a transforming the scores into T scores.” It is only appro-
particular demographic variable d1 . This equation clearly priate to normalize a distribution of scores if the lack of
indicates that the ability construct estimate will depend normality within the score distribution in question is due
on the test scores and particular demographics selected, to sampling error. The assumption underlying transfor-
and in turn on the particular emphases of the test battery mation to a normalized distribution is that the underlying
loadings on neuropsychological constructs estimated by distribution in question is in fact normal and that sampling
the test battery selected. errors have produced the lack of normality in the attained
Miller and Rohling recommend that multiple mea- distribution. If the attained distribution is truly represen-
sures be used to produce multiple estimates of premorbid tative of the actual distribution in question, converting
general ability and that these estimates then be combined. or normalizing the distribution will introduce additional
Each of the estimates mentioned has a different latent error into the process. This is often overlooked by re-
construct with its own distribution. Converting them all to searchers who desire to apply methods that require normal
T scores and using the mean to estimate PMGA is mathe- distributions of scores. The whole concept of normalizing
matically flawed since the underlying scales of measure- distributions was developed to correct for sampling error.
ment are not the same (e.g., rank in class is on an ordinal It is a widely held misconception that behavioral and cog-
scale and NART scores are interval scales and the two nitive and emotional variables are normally distributed
cannot be averaged in a way that makes sense). Miller (e.g., see discussions in Reynolds and Kamphaus, 1992,
P1: GAD
pl-nr2004.cls (03/25/2004 v1.1 LaTeX2e NERV document class) pp1358-nerv-494745 October 8, 2004 17:28

Critique 179

2002). Many are and others approximate normality. In Secondly, Miller and Rohling assume that a defini-
these instances, converting score distributions to a normal tive factor analysis exists for all batteries considered. The
curve does not adversely impact the data. However, if the outcome of a factor analysis as expressed in the number of
latent distribution is in fact nonnormal, and many such factors extracted and the correlation of each variable (test)
psychological variables are not normally distributed, nor- with the factor is a function of the variables in the specific
malization of the distribution is inappropriate and will lead analysis. The number of factors extracted and the relative
to invalid inferences or additional error in the inferential loadings for a set of tests may be altered dramatically by
process. the addition or deletion of several of the tests. No true
Step 4: Assign Each Test Score to a Cognitive Do- meta-analysis that matches factors and relative loadings
main. This step has a high degree of uncertainty in that on common constructs could be located. Such an analysis
there is no evidence for the interclinician reliability of would be necessary to have a strong empirical basis for
classifications of test scores. Although it might reason- assigning tests to domains, but even then, assignments
ably be expected to be moderately high, the error exists may need to vary for individual patients (see Kaufman,
in the clinician selection of the number of domains and in 1978, 1994, for additional clinical explanations). The un-
the classification of a given test xij to that domain. To the reliability of the process at this stage augments the error
extent clinicians do not all classify tests in the same way component of Miller and Rohling’s procedure. Granted
cumulates the errors associated with the estimation of the this would be a major undertaking, but the methods are
construct scores. available if not the time or the funding (e.g., see Cattell,
Miller and Rohling seem to recognize the problem 1978).
of subjectivity of test score classification and attempt to Step 5: Calculate Domain Means, Standard Devia-
avoid it by noting that each score is assigned to the cog- tions, and Sample Sizes. Determining the mean and stan-
nitive domain for which prior published factor analytic dard deviation of each domain is not as simple as the
research has found the highest loadings. This assumes authors present. It is not possible to calculate the standard
that all domains of cognitive function are known or at deviation of a domain score unless one knows the corre-
least that there is agreement on these. There is not. Clas- lation between each part score and the total score or the
sification of test scores by domains within the broader covariance matrix is available. Given that the examiner
domains of cognitive versus noncognitive, even in the selects a wide-ranging number of tests and then combines
face of strong theory and empirical data, is difficult. There them into a domain score, the number of possible standard
are many discussions of the subjectivity of factor naming deviations to be derived is substantial. One would have to
for example in factor analysis (e.g., Cattell, 1978). Many have a table of intercorrelations of all of the potential
people argue over the classification of memory versus tasks within a domain in order to calculate the standard
attention and even some well-known researchers (e.g., deviation correctly. This is crucial because it affects the
Jensen, Kaufman, Horn, and the like) are sharply divided score distribution itself. One can find more detailed infor-
over whether such a commonly used test as Coding from mation on this problem in published reviews of the fourth
the Wechsler scales is a cognitive task that should be revision of the Stanford Binet scales, which also fails to
characterized as a memory measure, or a very low level incorporate this problem into their norms table and for
measure of psychomotor speed, or is better classified as which they have been soundly criticized by the measure-
a measure of attentional skill. There are simply far too ment community. Some of the earliest of measurements
many legitimate disagreements in the field to allow such texts, such as Guilford’s in psychometric methods from
a simplification of these classifications. Also, Kaufman the 1950s, deal with this issue in some detail. There are
(1978, 1994) has argued cogently that observation of the also a variety of formulas in the literature that allow one to
patient while taking a test such as Coding or certain other calculate the standard deviation of a composite if one only
Wechsler tasks is necessary to determine the appropriate has the composite score and the correlation matrix. This
classification of a task for an individual patient. Thus, would not be needed if one were norming a composite
a single task may be assigned to different domains (and and had the complete normative data but then one would
accurately so) for particular patients based upon their ap- also have the correlation matrix. In Step 6, the test battery
proach to the task. It is interesting to see how Miller and means are discussed but you cannot make sense of these
Rohling have classified Trail Making and certain other without having the standard deviations correct because the
tasks to differentiate them from attention and concentra- underlying metric is in fact unknown and cannot assumed
tion and processing speed, and also assuming that process- to be the standard T -score SD of 10.
ing speed is cognate to psychomotor speed. Many would Step 6: Calculate Test Battery Means. No error is
debate this. All of this goes to address the complexity of added, but unless the standard deviations are correct, later
this issue and the simplicity of the authors’ response. applications of these means may be difficult to interpret.
P1: GAD
pl-nr2004.cls (03/25/2004 v1.1 LaTeX2e NERV document class) pp1358-nerv-494745 October 8, 2004 17:28

180 Willson and Reynolds

Step 7: Calculate Probabilities of Heterogeneity. The cause the standard error of estimate will be smaller, often
probabilities for the computed chi-square statistics are much smaller, than the standard error of measurement
highly sensitive to the errors from Step 4. The level of for an individual’s estimated score, the upper (or lower)
imprecision in classification and domain selection now limit of the confidence interval will be insufficiently ex-
becomes important because of the attempt to determine treme in correctly estimating the upper limit of premor-
chi-square probabilities. bid performance. Then, even with no impairment, there
Step 8: Determine Categories of Impairment. This would be a substantial proportion of scores beyond the
step is affected by cumulative error through Step 4 also. limit that would occur within the confidence interval
The resultant errors of category determination depend on of the standard error of measurement. This would re-
the amount of cumulative error in Step 4 and are aggra- sult in overclassification of an individual’s actual perfor-
vated by Step 5 error. mance as outside premorbid confidence intervals (loss of
Step 9: Determine the Percentage of Test Scores That ability).
Fall in the Impaired Range. The number of tests that are Step 13: Conduct One Sample t Tests on Each Type
classified into the impaired range will depend in part on of Mean Generated. The direction here is too vague to be
the cumulative errors up through Step 5 and in the degree certain of what is intended. Because t statistics for the dif-
to which the underlying metrics of the domain scores and ference from a mean depend on the standard deviation and
the individual test scores are accurately estimated (i.e., the sample size, the authors need to make this explicit. The
degree of deviation of the estimates form the true SDs). results of these analyses depend on the Step 2 EGPA esti-
The reliability of classification is an unknown function of mate, for which no standard deviation has been computed
this cumulative error. to this step, or if T scores are used, on the assumption
Step 10: Calculate Effect Sizes. The computations of of adequate estimates of any intermediate statistics. In
effect size depend on the estimate of the EPGA, which principle, however, this is not as troublesome as the con-
has been shown to depend on the test batteries and demo- fidence interval computations in Step 11. Nevertheless,
graphics selected, and the problems of selecting a method there is likely to be considerable cross-battery error in the
of estimation. t statistics and their probabilities. The lack of specific
Step 11: Calculate Confidence Intervals. The statisti- direction however will result in inconsistencies across
cal thinking here is either flawed or incomplete. Standard clinicians defeating the purpose once again of the entire
errors of measurement depend on the estimate of reliabil- procedure.
ity of the score being considered. The formula presented Step 14: Conduct a Between-Subjects Analysis of
by Miller and Rohling is incorrect. They provide an esti- Variance with Domain Means. As was noted in Step 4,
mate of the standard error of estimate, not of measurement. domain means will depend on battery selection, so the
The interpretation of these two is quite different, which ANOVA results will also vary accordingly.
affects later steps. Because no evidence for reliabilities for Step 15: Conduct a Power Analysis. The results of
many of the computed measures is presented by Miller and this step depend on the effect size computations, which
Rohling, nor are they likely to be easily estimated without depend on the EPGA estimate, with resultant uncertainty
extensive empirical research and simulation, standard er- and error.
rors of measurement are largely unknown. Standard errors Step 16: Sort Test Scores in Ascending Order. This
of estimate will in general be smaller than standard errors step is highly susceptible to random differences in scores,
of measurement when even if no between-battery error were present. It is quite
problematic to interpret the ordinal position, which loses
n > 1/(1 − reliability).
the important distance information.
Thus, if the reliability of EPGA is .8, then if n > 5, the Step 17: Graphically Display the Summary Statistics.
standard error of estimate will be smaller than the stan- Of concern is the overinterpretation of mean score vari-
dard error of measurement. This will produce artificially ation by clinicians. Without standard errors of measure-
small confidence intervals for test scores and give greater ment, for example, too much precision may be attached
credence to the accuracy of the results than is warranted. to means and associated t statistics.
Step 10 indicates that the confidence intervals will Step 18: Determine the Test Battery’s Validity. Al-
usually be smaller than is warranted by the reliability of though much of the discussion in this step is valid, the
the data. continued reliance on sample size rather than reliability
Step 12: Determine the Upper Limit Necessary for in score interpretation is invalid.
Premorbid Performance. The results of Step 11 will Step 19: The procedures discussed here do not in-
tend to produce estimates of maximum level of pre- crease error, but are expected to be affected by the errors
morbid performance that most likely are too low. Be- cumulated in earlier steps.
P1: GAD
pl-nr2004.cls (03/25/2004 v1.1 LaTeX2e NERV document class) pp1358-nerv-494745 October 8, 2004 17:28

Critique 181

Step 20: Use Test Battery Means to Determine if Im- for reasons described above, will by virtue of the appar-
pairment Exists. Again, the reliability of the classification ent statistical sophistication of the Miller and Rohling
depends on the selection of batteries in Step 1 and the technique, imbue the results with greater confidence than
accuracy of the estimation of the latent score distributions it is due. The proper level of confidence cannot be de-
within domain. fined precisely due to the noted difficulties in estimating
Step 21: Determine Current Strengths and Weak- the error components of the model but confidence can-
nesses. This is problematic because the CIs (confidence not be characterized as high at this point given the error
intervals) are overprecise estimates of patient variability. components identified and the many points of subjective
The general procedure is not necessarily at fault if appro- decision making inherent in the model and differences
priate standard errors of measurement are applied. in how clinicians will apply certain steps where choices
Step 22: Examine the Test Scores From Noncognitive are given. Neither, however, have the results been studied
Domains. It is unclear what effects will accrue from this for validity or reliability and such studies are necessary
step beyond adding to the inconsistency across clinicians before the method is applied clinically. However criti-
due to the lack of guidance in how to conduct and interpret cal we seem of the precise methods extolled in Miller
this analysis. and Rohling, we continue to laud these authors for their
Step 23: Explore Low Power Comparisons for Type intent despite finding fault with their method of imple-
II Errors. This whole step has misplaced precision and un- mentation and trust their work will stimulate more and
derstanding of the variability in patient test battery scores, improved thinking in this arena by researchers and clin-
stemming mostly from the confusion between standard icians alike. This alone makes it a worthwhile contribu-
error of estimate and standard error of measurement. tion. The solutions to these problems are complex and
Step 24: Examine the Response Operating Charac- Miller and Rohling have stimulated what we predict will
teristics of Numerically Sorted Scores. This is an over- be a major line of research in the discipline. Their work
application of a highly technical procedure. Although it certainly has helped us to understand better the issues
may function adequately if all clinicians use the same to consider and the complex interplay of these variables
battery, there is no attempt to determine false positives in diagnosis and classification of patient performance on
within- or between-batteries, raising another flag for error any set of measures, but particularly those that are not
in interpretation. conormed.
Our conclusion is simple. Parts of the procedure are
susceptible to between-battery variability that has not been
adequately evaluated for reliability or fidelity of analysis REFERENCES
and classification. Other steps founder on a misinterpreted
concept of variability based on standard error of estimate Cattell, R. B. (1978). The Scientific Use of Factor Analysis in the Be-
rather than standard error of measurement. This is a fatal havioral and Life Sciences, Plenum, New York.
Kaufman, A. S. (1978). Intelligent Testing With the WISC-R, Wiley
flaw in our opinion and must be revised before the steps Interscience, New York.
can even be considered within-battery for patient evalua- Kaufman, A. S. (1994). Intelligent Testing With the WISC-III, Wiley
tion. We were not able to achieve our earliest goal of pro- Interscience, New York.
Miller. L. S., and Rohling, M. L. (in press). A statistical interpretive
viding an accurate estimation of the error components of method for neuropsychological tests. Neuropsychol. Rev.
the model because too many components remain subjec- Pintner, R. (1923). Intelligence Testing: Methods and Results,
tive and the error components are inestimable. Accurate Henry Holt, New York.
Reynolds, C. R, and Kamphaus, R. W. (1992). Behavior Assessment
estimation of the total error of the model will be central to System for Children, American Guidance Service, Circle Pines,
its application and is necessary for its full evaluation but MN.
as of yet cannot be accomplished. Reynolds, C. R, and Kamphaus, R. W. (2002). Clinician’s Guide
to the Behavior Assessment System for Children, Guilford,
We fear that clinicians, who will themselves intro- New York.
duce inconsistencies and additional error into the process Witmer, L. (1902). Analytical Psychology, Ginn & Company, Boston.

You might also like