Professional Documents
Culture Documents
Abstract
Objective. Serial cognitive assessments are useful for many purposes, such as monitoring cognitive decline or evaluating the result
of an intervention. In order to determine if an observed change is reliable and meaningful, longitudinal reference data from non-
clinical samples are needed. Since neuropsychological outcomes are affected by language and cultural background, cognitive
tests should be adapted, and country-based norms collected. The lack of cross-sectional normative data for Spanish population
has been partially remediated, but there is still a need of reliable change norms. This paper aims to give an initial response to this
need by providing several reliable change indices (RCI) for 1-year follow-up in a Spanish sample.
Method. A longitudinal observational study was designed. A total of 122 healthy subjects over age 50 were evaluated twice (M ¼
369.5, SD¼ 10.7 days) with the NEURONORMA battery. Scores changes were analyzed, and simple discrepancy scores, standard
deviation indices, RCI, and standardized regression-based scores were calculated.
Results. Significant improvements were observed in variables related to memory, both verbal and visual, visuospatial function,
and the completion time of complex problems. Reference tables for several RCI are provided for their use in clinical settings.
Conclusions. Our results confirm the existence of heterogeneous practice effects after 1 year, and support the recommendation of
using reliable change norms to avoid misdiagnosis in repeated assessments. This study provides with initial, preliminary norms of
cognitive change for its use in Spanish elders. Further studies on larger samples and different inter-visit intervals are still needed.
Introduction
Performing serial assessments is a common practice in neuropsychology. Repetitive evaluations allow clinicians to accomplish
several objectives, such as following the natural progression of a disease or determining the effects of interventions. In geriatric
†
See appendix for Members of the NEURONORMA Study Team.
# The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
doi:10.1093/arclin/acw018 Advance Access publication on 24 April 2016
G. Sánchez-Benavides et al. / Archives of Clinical Neuropsychology 31 (2016); 378–388 379
populations, cognitive follow-up permits, for example, the tracking of cognitive decline of patients suffering from dementia or the
detection of its initial symptoms by following subjects at risk. However, interpreting longitudinal data encompasses some chal-
lenges, such as dealing with practice effects, or with cognitive changes associated with aging (Duff, 2012; McCaffrey, Duff, &
Westervelt, 2000). The cross-sectional normative approach, which is the most common way of evaluating if a given score can
be considered as indicative of impairment, is not suitable to decide whether a cognitive change is normal or not. In order to illustrate
this idea, imagine a highly educated 70-year-old male who begins to complain about his memory. After a neuropsychological
evaluation, his age- and education-adjusted performance in a list-learning memory test is 1 SD above his reference group. As a
Participants
The data used to develop the reliable change norms presented in this study come from the 1-year follow-up visit performed in a
subgroup of healthy individuals from the NEURONORMA study normative sample. The NEURONORMA battery, which is a
comprehensive neuropsychological battery composed of 14 tests (cf. infra), was initially administered to 356 cognitively
normal subjects aged between 50 and 85 years old to obtain Spanish age- and education-adjusted normative data. Nine hospitals
from different regions of Spain were involved in data collection to ensure the maximum representativeness of the population.
Subjects were recruited between 2004 and 2007. The recruitment methods and sample characteristics were extensively described
elsewhere (Peña-Casanova, Blesa, et al., 2009; Sánchez-Benavides et al., 2014). One-hundred and twenty-four individuals were
reassessed 1 year after baseline (mean interval 369.5 days, SD ¼ 10.7). From those 124 subjects, two were diagnosed of mild cog-
nitive impairment (MCI) at follow-up by a clinician blinded to the study findings and were excluded from the analyses. Although
the final analyzed sample includes 122 individuals, not all subjects completed all the tests and the number of subjects providing
data for each test varies from 103 to 119. Sociodemographic characteristics and screening outcomes at baseline can be seen in
Table 1. The study was approved by the Research Ethics Committees of the involved centers and was conducted in accordance
with the Declaration of Helsinki and its subsequent amendments. All subjects signed an informed consent before performing
any study procedure.
380 G. Sánchez-Benavides et al. / Archives of Clinical Neuropsychology 31 (2016); 378–388
Cognitive Measures
The NEURONORMA battery is composed of the following tests: Digit Span Forward and Backward; Visuospatial Span from
the WAIS-R-NI (Corsi’s Test); Trail Making Test (TMT); Symbol Digit Modalities Test (SDMT); Boston Naming Test (BNT);
Token Test; Selected subtests of the Visual Object and Space Perception Battery (VOSP): Object Decision, Progressive
Silhouettes, Position Discrimination, and Number Localization; Judgment of Line Orientation (JLO); Rey – Osterrieth
Complex Figure (ROCF), copy and memory at 3 and 30 min; Free and Cued Selective Reminding Test (FCSRT); Verbal
Statistical Analyses
Initially, a descriptive analysis of the sociodemographic and cognitive outcomes at baseline (T1) and follow-up (T2) was per-
formed. Paired t-tests were used to test for significant changes in scores between visits. Pearson correlations and Cohen’s d effect
size were also computed. Five reliable change scores were calculated following previous works on the topic (Duff, 2012, 2014):
simple discrepancy scores, standard deviation indices (SDI), classical RCI, RCI correcting for practice effects, and complex SRB
scores. For SRB, baseline scores, age, education, and sex were used to predict 1-year score. No simple SRB scores (i.e., using only
the baseline score) were computed because sociodemographic data are usually accessible to clinicians and can be easily included in
complex SRB calculations.
Results
Table 1 shows the baseline characteristics of the sample. Subjects’ mean age was 65 years and the mean years of formal edu-
cation was 11, which approximately corresponds to an ISCED-UNESCO level 2. Men were underrepresented in this follow-up
sample (31%). Such percentage is lower than the observed in the baseline global sample (n ¼ 356), which consists of 40.4% of
men. There were no significant differences between visits regarding the presence of depression symptoms, as measured by the
HDRS score [T1, M ¼ 2.7, SD ¼ 2.8; T2, M ¼ 3.0, SD ¼ 3.2; t(117) ¼ 20.14, p ¼ .298], showing a minimal mean variation
(T2 2 T1, M ¼ 0.3, SD ¼ 3.2). Table 2 summarizes the baseline and follow-up cognitive scores in the NEURONORMA
battery. Correlation values between T1 and T2 scores, along with effect sizes (Cohen’s d), are also shown. Global performance
tends to be better at follow-up. Such improvements were statistically significant in 11 out of the 33 studied variables:
Judgment of Line Orientation, t(109) ¼ 2.02, p ¼ .046; FCSRT Free recall trial 1, t(106) ¼ 4.04, p ¼ ,.001; FCSRT Total
free recall, t(106) ¼ 4.27, p ¼ ,.001; FCSRT Total recall, t(106) ¼ 3.45, p ¼ ,.001; FCSRT Free delayed recall, t(104) ¼
2.49, p ¼ .014; FCSRT Total delayed recall, t(104) ¼ 2.35, p ¼ .020; ROCF 30 min recall, t(106) ¼ 2.16, p ¼ .032; TOL-Dx
Total execution time, t(111) ¼ 23.19, p ¼ .002; TOL-Dx Total solving time, t(111) ¼ 23.13, p ¼ .002; VOSP Object decision,
t(110) ¼ 5.56, p , .001; VOSP Progressive silhouettes, t(110) ¼ 29.11, p , .001.
Relevant percentiles (2%, 5%, 16%, 50%, 84%, 95%, 98%) obtained from the calculations of simple discrepancy scores (T2 2 T1)
are displayed in Table 3. Medians are in most cases around 0 (i.e., no score change), except for memory measures and some timed
tasks, in which percentile 50 was associated with score improvements. These data can be easily used in clinical settings by searching
for the position in the distribution of percentiles in which patient discrepancy score falls. As usually interpreted, the percentile, or the
Table 2. Test and retest cognitive scores, practice effects, and test-retest correlations
Digit Span Forward 119 5.46 (1.10) 5.58 (1.09) 0.12 (1.01) 0.58 0.11
Digit Span Backward 119 3.92 (1.04) 4.01 (1.09) 0.09 (0.92) 0.62 0.09
Corsi’s Test Forward 113 4.05 (1.02) 4.12 (0.92) 0.06 (1.02) 0.50 0.06
Corsi’s Test Backward 113 3.41 (1.04) 3.50 (1.01) 0.09 (0.94) 0.50 0.09
TMT Part A 112 57.62 (27.79) 58.51 (26.49) 0.88 (21.81) 0.68 0.03
range of percentiles, associated with the raw score gives an idea about the likelihood of observing this degree of discrepancy in healthy
subjects. For example, for the TOL-Total moves, half of the sample (percentile 50) needs at least three fewer moves to solve the pro-
blems in the 1-year follow-up. Needing 25 extra moves at follow-up would fall within the 2–5 percentile range (Table 3), which can
lead to a clinical interpretation regarding the abnormality of such decline in healthy people over 50 years old.
Table 4 shows SDI and RCI indices. Although these indices are calculated using different formulae, their interpretation is
homogeneous. Change is in all cases divided by some measure of standard deviation and produces z-scores that can be easily inter-
preted and compared. SDI standardizes the discrepancy score (T2 2 T1) by dividing it by the SD at T1. For its part, RCI uses the
standard error of the difference (SED) in the numerator, which is an estimate of the standard deviation of the difference score. In the
RCI correcting for practice effects, the group mean of change is subtracted from the individual discrepancy score before dividing it
by the SED. This procedure “centers” the individual change at the mean normal variation. Finally, Table 5 shows the results of the
SRB. In the SRB method, a predicted T2 score is calculated accounting for subject’s baseline score and the sociodemographic pre-
dictors that were significant in the regression analyses. Then, the predicted score is subtracted from the subjects’ actual score and
divided by the standard error of the estimate (SEE) of the regression equation. The result can then be interpreted just as the previous
indices, by treating it as a regular z-score. The reader should notice that for inverse variables (i.e., those in which the higher the
score, the higher the impairment: TMT, ROCF-Copy Time, TOL-Dx Total Moves, TOL-Dx Initiation, Execution and Solving
Times, and VOSP Progressive Silhouettes), the interpretation of the obtained z-score should be reversed (i.e., positive z-scores
mean decline at follow-up).
382 G. Sánchez-Benavides et al. / Archives of Clinical Neuropsychology 31 (2016); 378–388
Discussion
This paper provides reliable change reference data for several measures from a sample of healthy Spanish subjects between 50
and 85 years old. These data can be used to determine if a subject’s 1-year cognitive change is meaningful or not. Percentiles for
simple discrepancy scores, SDI, RCI, and SRB indices for widely used neuropsychological tests are given.
We have found a global trend toward improvement in cognitive performance at follow-up, indicating the presence of practice
effects in most tests after 1 year (Table 2). This finding is not surprising, since 1-year practice effects are commonly reported in
healthy samples (e.g., Calamia et al., 2012; Jonaitis et al., 2015; Levine, Miller, Becker, Selnes, & Cohen, 2004). Despite this
global positive trend, only few variables reached significance in paired t-tests. Significant improvements were observed in vari-
ables related to memory, both verbal (FCSRT) and visual (ROCF), visuospatial function (JLO, VOSP Object Decision and
Progressive Silhouettes), and the completion time of complex problems (TOL-Dx Total Execution Time, and Total Solving
Time). Regarding memory tasks, when no alternative versions are used, the examinee would learn both the procedure of the
task and the specific materials to be remembered, and consequently larger practice effects are expected. In our sample, practice
effects in memory tasks showed heterogeneous effect sizes values that range from 0.13 in the delayed recall of the ROCF to
0.40 in the Free Recall Trial 1 of the FCSRT. Previous reports found a high degree of heterogeneity in practice effects for tasks
within the same cognitive domain (Calamia et al., 2012). Such heterogeneity could be likely related to specific test-associated
traits that modulate the learning at baseline. Heterogeneity phenomena can be also observed in visuospatial data. While the
spatial perception subtests of the VOSP (i.e., Position Discrimination and Number Localization) showed minimal improvement
G. Sánchez-Benavides et al. / Archives of Clinical Neuropsychology 31 (2016); 378–388 383
Digit Span Forward (T2 2 T1)/1.10 (T2 2 T1)/1.01 [(T2 2 T1) 2 0.12]/1.01
Digit Span Backward (T2 2 T1)/1.04 (T2 2 T1)/0.91 [(T2 2 T1) 2 0.09]/0.91
Corsi’s Test Forward (T2 2 T1)/1.02 (T2 2 T1)/1.02 [(T2 2 T1) 2 0.06]/1.02
Corsi’s Test Backward (T2 2 T1)/1.04 (T2 2 T1)/1.04 [(T2 2 T1) 2 0.09]/1.04
TMT Part A (T2 2 T1)/27.79 (T2 2 T1)/22.23 [(T2 2 T1) + 0.88]/22.23
after 1 year, the JLO test, which is thought to tap the same underlying cognitive domain, displayed a significant improvement. The
other two VOSP subtests, which assess object recognition (i.e., Object Decision and Progressive Silhouettes), showed larger prac-
tice effects, being the Progressive Silhouettes showing the highest effect size in the battery (20.85). In Progressive Silhouettes, the
examinee should recognize as early as possible an object that is initially presented from a non-prototypical point of view. In sub-
sequent drawings (10 per item), the silhouettes progressively reveals more details of the object by presenting it in a more inform-
ative rotated view toward its elongated axis. There are two objects to be recognized. Because of its nature, this recognition task is
clearly influenced by previous exposure to the test and individuals can recognize the objects with less visual information in a
second assessment because they remember the object. There are other neuropsychological tests, not studied here, that are even
more prone to show large practice effects, such as the Wisconsin Card Sorting Test, that has been labeled as a “one-shot test”
(Calamia et al., 2012) because the main measures of the test are compromised by previous exposure. In fact, the novelty of the
task seems one of the most relevant factors in the magnitude of practice effects. The more the novelty, the greater the practice
effect (Cysique et al., 2011; Dikmen, Heaton, Grant, & Temkin, 1999). In the current study, the TOL-Dx could be labeled as
the most “novel” task of the battery, and the significant decrements in the time needed to solve the problems can be interpreted
under this novelty effect assumption.
Negligible practice effects (operationalized as an arbitrary absolute Cohen’s d value below 0.1) were found for many variables.
The absence of significant improvements in these variables can be explained in terms of stability in memory-free variables [e.g.,
accuracy in constructional praxis (copy of the ROCF)], lack of novelty of the task [e.g., recalling vocabulary (BNT) or following
commands (Token Test)], or minimum learning of the specific items in “performance time-based” executive tests (e.g., the TMT,
SDMT, and Stroop test). However, these results can be also explained in part from an aging perspective. Previous studies reported
384 G. Sánchez-Benavides et al. / Archives of Clinical Neuropsychology 31 (2016); 378–388
Digit Span Forward 36.38 (2,116) 0.38 0.86 2.52 + T1 × 0.46 + edu × 0.05
Digit Span Backward 38.97 (3,115) 0.49 0.77 3.29 + T1 × 0.47 2 age × 0.03 + edu × 0.05
Corsi’s Test Forward 24.23 (2,110) 0.29 0.77 4.10 + T1 × 0.42 2 age × 0.03
Corsi’s Test Backward 15.95 (3,109) 0.29 0.85 3.00 + T1 × 0.36 2 age × 0.02 + edu × 0.03
TMT Part A 50.96 (2,109) 0.47 19.22 35.1 + T1 × 0.56 2 edu × 0.84
controversial findings regarding practice effects on the studied tests, which seem to be highly influenced by participants’ age.
While young, well-educated samples obtain a consistent benefit of previous exposure (e.g., Attix et al., 2009; Estevis, Basso,
& Combs, 2012; Levine et al., 2004; Salinsky, Storzbach, Dodrill, & Binder, 2001), older samples show much less practice
effect at retest (Calamia et al., 2012; Gavett, Ashendorf, & Gurnani, 2015). A relevant question that would need further research
is whether reliable changes show heteroscedasticity between different age groups. In that case, the development of norms for
change should be specifically developed on different age bands to account for such behavior, and larger samples would be
needed. In addition, the relationship between reliable change and age largely depends upon the specific traits of the task and
the inter-visit interval. As suggested by our results on the TMT, which rather than displaying a mean improvement at retest it dis-
played a decrease (1 s slower in part A and 7 in part B), some tests could be relatively more influenced by aging-related changes than
by learning from previous exposure. In any case, the lack of statistical significance of the change in the TMT between visits pre-
vents us from drawing strong conclusions, and such aging-related interpretation is merely speculative. Further research on the topic
could untangle if reliable change might be different at different ages.
It has been suggested that age-related influences in practice effects operate mainly at the initial assessment and are driven by the
age-associated differences in learning ability at baseline (Salthouse, 2011). As displayed in our SRB calculations (Table 5), the
strongest predictor of score at retest is in fact the score at first visit, which is assumed to be influenced by age and education
factors. Relevance of initial performance in the prediction of performance at subsequent assessments has been previously reported
in studies using SRB indices (e.g., Attix et al., 2009; Duff, 2014) and supports the idea that the best predictor of future behavior
is past behavior. In our SRB models, age and education also significantly predicted retest scores in most of the variables
under study.
G. Sánchez-Benavides et al. / Archives of Clinical Neuropsychology 31 (2016); 378–388 385
With the aim of illustrating the application and usefulness of the indices presented here, an example is provided below.
Continuing with our hypothetical case of a 70-year-old man with 15 years of education who has memory complaints. Let us
imagine that he is administered the FCSRT twice with a test-retest interval of 1 year. His Total Recall raw score at baseline is
47 out of 48. This raw score corresponds to an age- and education-adjusted scaled score of 14, with an associated range of percen-
tiles between 90 and 94, according to Spanish published norms (see Peña-Casanova, Gramunt-Fombuena et al., 2009). At follow-
up, his score decreases to 43, which corresponds to a scaled score of 11 (associate percentile range 60 –71), being still above his
reference group mean and far from the usual MCI cut-off score of 21.5 SD (, percentile 7). A simple discrepancy score analysis
“subtle cognitive decline” to define Stage III preclinical AD should be made cautiously. Robust change norms obtained from indi-
viduals that either do not develop AD pathology in further follow-ups or are negative for AD biomarkers at baseline will overcome
this limitation.
Finally, a comment on the generalizability of these data should be made. These norms are derived from a sample of
Spanish-speaking subjects from Spain. Therefore, a cautious use should be made if they are applied in a different population.
We would like to recommend that clinicians from other Spanish-speaking countries, namely from Mexico and South and
Central Americas, to be aware of cultural and linguistic differences when making clinical interpretations derived from these data,
Funding
This study was mainly supported by a grant from the Pfizer Foundation, and by the Medical Department of Pfizer, SA. Spain. It
was also supported by the Behavioral Neurology group of the Program of Neuroscience, Hospital del Mar Research Institute,
Barcelona, Spain. JP-C has received an intensification research grant from the CIBERNED (Centro de Investigación
Biomédica en Red sobre Enfermedades Neurodegenerativas), Instituto Carlos III (Ministry of Health & Consumer Affairs of
Spain).
Conflict of Interest
None declared.
Acknowledgements
We would like to thank the participants in this study for their time and collaboration. The authors would also like to thank the
reviewers of this paper for their valuable comments.
Steering committee: JP-C, Hospital del Mar, Barcelona, Spain; RB, Hospital de la Santa Creu i Sant Pau, Barcelona, Spain; MA,
Hospital Mútua de Terrassa, Terrassa, Spain. Principal investigators: JP-C, Hospital del Mar, Barcelona, Spain; RB, Hospital de la
Santa Creu i Sant Pau, Barcelona, Spain; MA, Hospital Mútua de Terrassa, Terrassa, Spain; Jose Luis Molinuevo, Hospital Clı́nic,
Barcelona, Spain; AR, Hospital Clı́nico Universitario, Santiago de Compostela, Spain; MSB (deceased), Hospital Clı́nico San
Carlos, Madrid, Spain; CA, Hospital Virgen Arrixaca, Murcia, Spain; Carlos Martı́nez-Parra (deceased), Hospital Virgen
Macarena, Sevilla, Spain; AF-G, Hospital Universitario La Paz, Madrid, Spain; MF-M, Hospital de Cruces, Bilbao, Spain.
Genetics substudy: Rafael Oliva, Service of Genetics, Hospital Clı́nic, Barcelona, Spain. Neuroimaging substudy: Beatriz
Gómez-Ansón, Radiology Department and IDIBAPS, Hospital Clı́nic, Barcelona, Spain. Research Fellows: Gemma Monte,
Elena Alayrach, Aitor Sainz, and Claudia Caprile, Fundació Clinic, Hospital Clinic, Barcelona, Spain; Gonzalo
Sánchez-Benavides, Behavioral Neurology Group, Institut Municipal d’Investigació Médica, Barcelona, Spain. Clinicians, psy-
chologists, and neuropsychologists: Nina Gramunt (Coordinator), Peter Böhm, Sonia González, Yolanda Buriel, Marı́a Quintana,
Sonia Quiñones, Gonzalo Sánchez-Benavides, Rosa M. Manero, Gracia Cucurella, Institut Municipal d’Investigació Mèdica,
Barcelona, Spain; Eva Ruiz, Mónica Serradell, Laura Torner, Hospital Clı́nic, Barcelona, Spain; Dolors Badenes, Laura Casas,
Noemı́ Cerulla, Silvia Ramos, Loli Cabello, Hospital Mútua de Terrassa, Terrassa, Spain; Dolores Rodrı́guez, Clinical
Psychology and Psychobiology Department, University of Santiago de Compostela, Spain; Marı́a Payno, Clara Villanueva,
Hospital Clı́nico San Carlos, Madrid, Spain; Rafael Carles, Judit Jiménez, Martirio Antequera, Hospital Virgen Arrixaca,
Murcia, Spain; Jose Manuel Gata, Pablo Duque, Laura Jiménez, Hospital Virgen Macarena, Sevilla, Spain; Azucena Sanz,
Marı́a Dolores Aguilar, Hospital Universitario La Paz, Madrid, Spain; Ana Molano, Maitena Lasa, Hospital de Cruces, Bilbao,
Spain. Data management and biometrics: Josep Maria Sol, Francisco Hernández, Irune Quevedo, Anna Salvà, Verónica
Alfonso, European Biometrics Institute, Barcelona, Spain. Administrative management: Carme Pla (deceased), Romina Ribas,
Department of Psychiatry and Forensic Medicine, Universitat Autònoma de Barcelona, and Behavioral Neurology Group,
Institut Municipal d’Investigació Mèdica, Barcelona, Spain.
G. Sánchez-Benavides et al. / Archives of Clinical Neuropsychology 31 (2016); 378–388 387
References
Albert, M. S., DeKosky, S. T., Dickson, D., Dubois, B., Feldman, H. H., Fox, N. C., et al. (2011). The diagnosis of mild cognitive impairment due to Alzheimer’s
disease: Recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease.
Alzheimer’s and Dementia, 7 (3), 270– 279.
Ardila, A. (2005). Cultural values underlying psychometric cognitive testing. Neuropsychology Review, 15 (4), 185–195.
Attix, D. K., Story, T. J., Chelune, G. J., Ball, J. D., Stutts, M. L., Hart, R. P., et al. (2009). The prediction of change: Normative neuropsychological trajectories. The
Sánchez-Benavides, G., Peña-Casanova, J., Casals-Coll, M., Gramunt, N., Molinuevo, J. L., Gómez-Ansón, B., et al. (2014). Cognitive and neuroimaging profiles
in mild cognitive impairment and Alzheimer’s disease: Data from the Spanish Multicenter Normative Studies (NEURONORMA Project). Journal of
Alzheimer’s Disease, 41 (3), 887–901.
Schoenberg, M. R., Rinehardt, E., Duff, K., Mattingly, M., Bharucha, K. J., & Scott, J. G. (2012). Assessing reliable change using the repeatable battery for the
assessment of neuropsychological status (RBANS) for patients with Parkinson’s Disease undergoing deep brain stimulation (DBS) surgery. The Clinical
Neuropsychologist, 26 (2), 255– 270.
Sperling, R. A., Aisen, P. S., Beckett, L. A., Bennett, D. A., Craft, S., Fagan, A. M., et al. (2011). Toward defining the preclinical stages of Alzheimer’s disease:
Recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s