You are on page 1of 12

The Clinical Neuropsychologist

ISSN: 1385-4046 (Print) 1744-4144 (Online) Journal homepage: http://www.tandfonline.com/loi/ntcn20

Assessing developmental delay in early childhood


— concerns with the Bayley-III scales

Peter J. Anderson & Alice Burnett

To cite this article: Peter J. Anderson & Alice Burnett (2016): Assessing developmental delay
in early childhood — concerns with the Bayley-III scales, The Clinical Neuropsychologist, DOI:
10.1080/13854046.2016.1216518

To link to this article: http://dx.doi.org/10.1080/13854046.2016.1216518

Published online: 12 Aug 2016.

Submit your article to this journal

Article views: 47

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=ntcn20

Download by: [Ryerson University Library] Date: 18 October 2016, At: 08:30
The Clinical Neuropsychologist, 2016
http://dx.doi.org/10.1080/13854046.2016.1216518

Assessing developmental delay in early childhood—concerns


with the Bayley-III scales
Peter J. Andersona,b and Alice Burnetta,b,c
a
Clinical Sciences, Murdoch Childrens Research Institute, Melbourne, Australia; bDepartment of Paediatrics,
The University of Melbourne, Melbourne, Australia; cNeonatal Medicine, Royal Children’s Hospital, Melbourne,
Australia

ABSTRACT ARTICLE HISTORY


Objective: Early detection of children with developmental delay is Received 7 April 2016
crucial for determining which children require close surveillance and Accepted 19 July 2016
intervention services. For many decades, the Bayley Scales has been
KEYWORDS
the most widely used objective measure of early developmental delay, Early childhood; infancy;
both in clinical and research settings. Significant structural changes developmental delay; Bayley
were incorporated in the most recent edition, the Bayley Scales of Scales
Infant and Toddler Development, Third Edition (Bayley-III). This article
reviews the psychometric properties of the Bayley-III and investigates
criticisms raised on the Bayley-III, namely that it overestimates
developmental status and is a poor predictor of later functioning.
Method: This critical review examines the literature on the Bayley-
III, which was released in 2006. Results: The Cognitive, Language,
and Motor composites of the Bayley-III overestimate development,
resulting in an under-identification of children with developmental
delay. A range of strategies have been proposed for dealing with the
inflated scores on the Bayley-III, none of which are ideal. Evidence to
date suggests that the Bayley-III is a poor predictor of later cognitive
and motor impairments. Conclusions: The Bayley-III needs new
norms, or alternatively, it may be time for a new edition of the Bayley
Scales.

Early detection of children with developmental delay is of utmost importance because inter-
vening early can prevent or reduce later cognitive, behavioral, educational, and social prob-
lems (Doyle, Harmon, Heckman, & Tremblay, 2009; Spittle, Orton, Anderson, Boyd, & Doyle,
2015). However, enormous variability is a feature of early cognitive, language, motor and
behavioral development, making it challenging for clinicians to diagnose developmental
delay. Screening questionnaires are sometimes used such as the Ages and Stages
Questionnaire (ASQ3) (Squires & Bricker, 2009), but standardized clinician-administered
instruments remain the gold standard approach for assessing developmental status.
Developmental surveillance programs are now common in clinical and research settings
to determine: (1) an individual’s eligibility for intervention services; (2) type of support
required for individual children; (3) need for ongoing surveillance; (4) an individual’s progress

CONTACT  Peter J. Anderson  peter.anderson@mcri.edu.au


© 2016 Informa UK Limited, trading as Taylor & Francis Group
2    P. J. Anderson and A. Burnett

while in an intervention program; (5) the outcome of clinical trials; and (6) the safety of
pharmacological interventions. These surveillance programs tend to focus on high-risk
infants such as those with congenital abnormalities (e.g. neurofibromatosis, Prader–Willi
syndrome, congenital diaphragmatic hernia), who experienced neonatal complications (e.g.
very preterm birth, hypoxic–ischemic encephalopathy, stroke), were exposed to neurotoxins
(fetal alcohol syndrome, neonatal abstinence syndrome), and from socially disadvantaged
environments. Given the increased rates of developmental problems and long-term neu-
robehavioral problems in these clinical populations, close monitoring of their development
is justified so that issues can be detected and managed early before they become severe
and entrenched.
In this article, we review the Bayley Scales of Infant and Toddler Development, Third
Edition (Bayley-III), which is the most widely used standardized measure of early develop-
ment for both clinical and research purposes. We begin with a description of the Bayley-III,
followed by a critical narrative review of an emerging literature that suggests that the Bayley-
III overestimates development, and as such under-identifies developmental delay (Aylward,
2013). Strategies that have been proposed for dealing with inflated Bayley-III scores will be
presented, and the capacity of the Bayley-III to predict later cognitive and motor problems
will be discussed.

The Bayley Scales


The Bayley-III is the most widely used tool for assessment of early development. Currently
in its third edition (Bayley, 2006b), the Bayley Scales’ primary objective is ‘to identify children
with developmental delay and to provide information for intervention planning’ (Bayley,
2006a, p. 1), through individually administered assessment of children aged 1–42 months.
The test authors emphasize the difference between tests of development, such as the Bayley
Scales, and tests of general intelligence, noting that developmental tests do not assume
that a measure of ability at one time point will predict later ability, because of the rapid and
qualitative changes characteristic of early development (Bayley, 2006a).
The original Bayley Scales of Infant Development (BSID) was published almost 50 years
ago (Bayley, 1969). The purpose of the tool and the ‘eclectic’ nature of its theoretical under-
pinnings have remained constant, with no single theoretical orientation governing the test’s
content (Bayley, 1993, 2006b). However, the format and content have evolved over time.
Both the BSID and its second edition, the BSID-II (Bayley, 1993), comprised two direct assess-
ment indices: the Mental Development Index (MDI, tapping cognitive, language, and social
skills) and the Psychomotor Development Index (PDI, tapping fine and gross motor skills),
as well as a complementary measure assessing the child’s behavior during assessment (the
Infant Behavior Record or Behavior Rating Scale). The second edition revised the content
items of the original Bayley Scales (deleted awkward items and added over 100 new items),
introduced the administration of item sets based on the child’s age, updated the normative
data to reflect a contemporary sample of race/ethnicity, sex of the child, parental education,
and demographic location, expanded the age range from 2–30 months of age to 1–42 months
of age, and improved clinical utility by reporting data from clinical or at-risk groups of infants
and toddlers (Bayley, 1993). Facet scores were also derived from the MDI and PDI items to
provide a general developmental age for cognitive, language, and motor domains. While
the BSID and BSID-II were popular measures with established psychometric properties, they
The Clinical Neuropsychologist   3

lacked clinical specificity (Moore, Johnson, Haider, Hennessy, & Marlow, 2012). For example,
the MDI scale did not differentiate children with selective cognitive delay from those with
language delay, while the PDI did not differentiate children with selective fine motor delay
from those with gross motor delay.
The most substantial update in test structure came with the next revision and restand-
ardization: the Bayley Scales of Infant and Toddler Development, Third Edition, or Bayley-III
(Bayley, 2006b). This edition saw the creation of five distinct scales to better align with gov-
ernment guidelines regarding early childhood assessment, with the Bayley-III including
Cognitive (91 items), Language (97 items), and Motor scales (138 items), and caregiver ratings
of Social-Emotional (35 items) and Adaptive Behavior (241 items). A crucial change from the
BSID-II to Bayley-III was the separation of assessment of cognitive, expressive language (48
items), and receptive language (49 items) skills, as well as the separation of fine (66 items)
and gross (72 items) motor tasks, into subtests with explicit normative data. This structural
change is thought to enhance the clinical utility of the Bayley Scales, as a more detailed
assessment of strengths and weaknesses is possible, enabling more targeted interventions
to be prescribed. The Social-Emotional scale is based on the Greenspan Social-Emotional
Growth Chart (Greenspan, 2004) and is completed by the child’s caregiver to assess emotional
development and behavior. The Adaptive Behavior scale, also to be completed by caregivers,
is the second edition of the Adaptive Behavior Assessment System (ABAS-II) (Harrison &
Oakland, 2003). This scale estimates the child’s functioning in a wide range of adaptive skills.
The downside of these structural changes to the Bayley-III are that it takes longer to admin-
ister (up to 90 min) and it compromises the capacity for clinicians and researchers to compare
Bayley-III scores with those from earlier editions. In most countries, the US norms are used
to interpret performance, although some local standardisations have been conducted (e.g.
in the Netherlands and Germany) (Steenis, Verhoeven, Hessen, & van Baar, 2015).

Psychometric properties of the Bayley-III


Unlike the earlier versions of the Bayley Scales, the third edition was normed using a mixed
sampling procedure. This involved a sample of typically developing children (n = 1700) rep-
resentative of the general US population on demographic characteristics. The normative
sample comprised 17 age groups of 100 children with age bands of 10-day intervals up to
5 months of age, 1-month intervals from 5 months to 36 months of age, and 3-month inter-
vals from 36 to 42 months of age. Added to the standardization sample (approximately 10%
of the overall sample) were children drawn from special group studies, including children
with Down syndrome, cerebral palsy, pervasive developmental disorder, premature birth,
language impairment, and children at-risk for developmental delay (Bayley, 2006a). The test
authors explain that this mixed sampling procedure was intended to ‘more accurately rep-
resent the population of infants and toddlers’ (Bayley, 2006a, p. 34). However, it has been
argued that a mixed sampling approach that includes children with delay/impairment lowers
the normative mean, increases the standard deviation, and decreases classification accuracy
by reducing the level of performance expected of typically developing children (Aylward,
2013; Pena, Spaulding, & Plante, 2006).
The Bayley-III technical manual reports good internal consistency, with high reliability
coefficients for the Cognitive (.91), Language (.93), and Motor (.92) scales (Bayley, 2006a).
Test–retest stability was examined by repeat assessment on a group of children with an
4    P. J. Anderson and A. Burnett

average interval between testing of 6 days (2–15 days). Average stability coefficients were
acceptable (Cognitive – .79, Language – .78, Motor – .81), but higher in older children. In
terms of validity, the Cognitive and Language scales correlated moderately to strongly with
the Wechsler Preschool and Primary Scale of Intelligence (WPPSI-III; .79–.82) for 57 children
aged 28–42 months, while the Motor scale correlated moderately with the Peabody
Developmental Motor Scales (PDMS-2; .49–.57). For 102 children, the BSID-II and Bayley-III
were administered in a counterbalanced order with a mean interval of 6 days. The BSID-II
MDI correlated more strongly with the Bayley-III Language scale (.71) than the Cognitive
scale (.60), while the BSID-II PDI and Bayley-III Motor composite correlated moderately (.60).
Of importance, the Bayley-III composite scores were approximately 7 points higher than the
respective BSID-II scales (Bayley, 2006a). This is contrary to expectations as scale scores usu-
ally decline when tests are revised and re-standardized due to creeping phenomena com-
monly observed on developmental/intelligence tests (Aylward & Aylward, 2011; Flynn, 1999).
As development is a dynamic process with children maturing specific skills at different
rates, only moderate long-term stability of developmental status might be expected. To our
knowledge, this issue has been addressed only in clinical and mixed clinical/non-clinical
samples among English-speaking children. A recent report examined performance on the
Bayley-III at 8 and 20 months of age in a sample of 131 preterm children (Greene, Patra,
Silvestri, & Nelson, 2013). The mean Cognitive and Language composite scores were approx-
imately 6 points lower at the 20-month assessment than at the 8-month assessment, while
the Motor composite remained stable. The correlation coefficients for the Cognitive and
Motor composites at the two time points were moderately strong (.57 and .53, respectively),
while the association of the Language composite at the 8- and 20-month assessments was
only fair (.36). The rate of impairment (> 2SDs below the normative mean) was low at 8 and
20 months for all scales, although cognitive delay increased from 1.5 to 6.9%, language delay
increased from 3.6 to 21.4%, while the rate of motor delay remained stable. A more ambitious
study explored the stability of developmental delay at 7 time points (3, 4, 6, 9, 12, 18, and
24 months of age) in a mixed sample of preterm and term children (n = 54), in which delay
was classified as a scale score more than 1.5 SDs below the normative mean (Lobo, Paul,
Mackley, Maher, & Galloway, 2014). Children were categorized as exhibiting stable develop-
ment if they remained in the same classification (delayed or not) at all 7 time points, cate-
gorized as relatively stable if they changed classification once across the time points, or
categorized as unstable if they changed classification more than once. The proportion of
children with a stable developmental pattern was relatively low, ranging from 17% for recep-
tive language to 65% for fine motor. Only a small number of children were categorized as
exhibiting a relatively stable developmental pattern, but the proportion of children with
unstable profiles was considerable, ranging from 28% for expressive language to 67% for
receptive language.

Under-identification of developmental delay


The Bayley-III was released in 2006 but it was not until 2010 that concerns regarding its
under-reporting of developmental delay were first reported (Anderson et al., 2010). This
Australian study of extremely preterm and full-term 2-year-olds reported that the scores on
the Bayley-III were elevated by more than .5 SD on the Cognitive and Language composite
scales, and more than 1 SD on the Motor composite, in healthy full-term controls relative to
The Clinical Neuropsychologist   5

the normative mean. While these higher-than expected scores could reflect geographic
variability or sampling characteristics, anecdotal reports from clinicians using the Bayley-III
supported these findings. Since this initial publication, there have been numerous reports
supporting this concern with the Bayley-III. For example, some studies have reported marked
differences between BSID-II and Bayley-III scores (Moore et al., 2012; Silveira, Filipouski,
Goldstein, O’Shea, & Procianoy, 2012; Vohr et al., 2012), with a significantly lower rate of
developmental delay on the more recent edition.
The National Institute of Child Health and Human Development’s Neonatal Research
Network administered the BSID-II to 1012 extremely low birth-weight (ELBW) children aged
18–22 months born between 2006 and 2007 (period 1) and the Bayley-III to 1616 ELBW
children born between 2008 and 2011 (period 2) (Vohr et al., 2012). The mean Cognitive and
Language composite scores for the children in period 2 were 11 and 7 points higher, respec-
tively, than the mean MDI score for children in period 1, while the mean Motor score was 6
points higher in period 2. The rate of impairment (standard scores less than 70) was signifi-
cantly lower in period 2 (Cognitive – 10%, Language – 19%, Motor – 14%) than in period 1
(MDI – 37%, PDI – 27%). This difference in Bayley scores between periods 1 and 2 may be
explained by demographic differences or improved outcome between these two cohorts
rather than the test itself. However, other studies have found similar results when the BSID-II
and Bayley-III have been administered to the same sample (Moore et al., 2012; Silveira et al.,
2012). For example, the EPICure-2 cohort (children born less than 27 weeks of gestation in
England in 2006) were assessed on the same day using the Cognitive and Language items
from the Bayley-III and MDI items from the BSID-II at a median age of 33 months corrected
for prematurity (Moore et al., 2012). While the mean Bayley-III Cognitive scale was only 3
points higher than the mean BSID-II MDI score, the mean Bayley-III Language scale was 10
points higher than the mean MDI and the rate of impairment dropped from 25% with the
BSID-II to 14% with the Bayley-III. This study also found that the mean MDI score was less
correlated with an averaged Cognitive-Language score from the Bayley-III at lower scores
than at higher scores, a concerning finding given it is children scoring lower who most require
intervention. The limitations of the EPICure-2 study were that the BSID-II was not adminis-
tered according to standardized procedures, with the BSID-II-only items administered at the
end when attention and compliance are most challenged, and the Motor scales were not
administered. In contrast, a smaller Brazilian study administered the BSID-II and Bayley-III
on separate occasions (within 2 months) to 60 very preterm children, with mean Cognitive
and Language scales of the Bayley-III being 8–9 points higher than the MDI and the Motor
scale of the Bayley-III being 14 points higher than the PDI (Silveira et al., 2012). As the Bayley-
III was always administered after the BSID-II, it is possible that at least some of this difference
reflects learning effects given the marked overlap of items of the two editions. Similar con-
cerns with the Bayley-III have been reported in other clinical populations including infants
following neonatal encephalopathy and therapeutic hypothermia (Jary, Whitelaw, Walløe,
& Thoresen, 2013) and infants requiring complex cardiac surgery (Acton et al., 2011).
There are two explanations for the Bayley-III scores being significantly higher than BSID-II
scores. Most have assumed that the Bayley-III overestimates development (i.e. scores are
inflated), but it is equally possible that the BSID-II under-estimates development (i.e. scores
are decreased). Supporting the former explanation, the reported rate of impairment on
BSID-II assessments has generally been in line with expectations based on the normal dis-
tribution and previous literature, while the reported rate of impairment on Bayley-III
6    P. J. Anderson and A. Burnett

assessments has been well below expectations (Anderson et al., 2010; Moore et al., 2012;
Silveira et al., 2012; Vohr et al., 2012). Even so, the studies contrasting the Bayley-III and BSID-II
have been predominantly with preterm and other high-risk populations, and the true rate
of developmental delay in these clinical disorders is not known. To confirm speculation that
the Bayley-III scores are inflated, evidence with representative samples of typically develop-
ing children is needed. In a large cohort (n = 202) of healthy full-term / normal birth-weight
children at 24 months of age with social demographic characteristics approximately repre-
sentative of the Australian community, the Bayley-III composite scores were well above the
normative mean (Cognitive – 109, Language – 108, Motor – 118) (Anderson et al., 2010). In
a representative cohort, it would be expected that the rate of mild-to-severe delay (com-
posite scores < 85) would be approximately 16–17%; however in this Australian cohort, the
rate of even mild delay was only 1, 4 and 2% for the Cognitive, Language and Motor com-
posites, respectively. Similarly, in a large Swedish cohort of term children randomly selected
from the Swedish Medical Registry (n = 366), the mean Cognitive (104), Language (109), and
Motor (107) composites were markedly higher than the normative mean. Furthermore, in a
large Dutch sample, the difference between test scores using local and US norms fluctuated
across domains and age (Steenis et al., 2015).
Based on this body of research, we believe there is sufficient evidence to argue that the
Bayley-III overestimates development, and as a consequence, under-reports developmental
delay. The degree to which the Bayley-III overestimates developmental status may vary across
different age bands; however, there is at least some evidence that it occurs in the first (Reuner,
Fields, Wittke, Löpprich, & Pietz, 2013), second (Acton et al., 2011; Anderson et al., 2010; Jary
et al., 2013; Silveira et al., 2012; Vohr et al., 2012), and third (Moore et al., 2012) years of life.
Further, the degree of overestimation may also vary across different levels of performance
(Moore et al., 2012). The mixed sampling procedure used for the standardization of the
Bayley-III is the most likely reason for the overestimation, as this approach tends to lower
group means, increase SD, and as a result, decrease the capacity to detect developmental
delay (Pena et al., 2006).

Strategies for dealing with inflated scores on the Bayley-III


Under-identification of developmental delay is a significant issue (Aylward, 2013). Clinically,
many families are being informed that their child is developing appropriately when they are
not, and some are missing out on intervention services that are clearly warranted. From a
research perspective, studies are under-reporting the frequency and severity of develop-
mental delay in clinical populations, resulting in misleading feedback to clinicians, who in
turn provide families with overly optimistic prognostic information. When used as the pri-
mary outcome in randomized controlled trials, the under-estimation of developmental delay
arising from the Bayley-III results in reduced statistical power as sample sizes have usually
been calculated using estimates from previous research using the BSID-II (Johnson, Moore,
& Marlow, 2014).
Given the wide use of the Bayley-III and the lack of alterative measures, a number of
potential solutions have been proposed to correct the inflated scores. In relation to classifying
developmental delay, cut-off points can be adjusted to be more reflective of community
estimates. Using the EPICure-2 sample of extremely preterm children, different cut-offs were
applied for predicting moderate-to-severe delay based on the BSID-II (Johnson et al., 2014),
The Clinical Neuropsychologist   7

with the best prediction for MDI < 70 being Bayley-III Cognitive and Language composite
scores < 85 or the average of Cognitive and Language composite scores < 80 in their
33-month-old sample (Johnson et al., 2014; Moore et al., 2012). While this practical approach
has merit for some research and clinical purposes, (1) it is based on the outdated BSID-II,
which may also under-identify delay, (2) reverts back to a composite score which has limited
clinical utility, (3) is not applicable for classifying mild delay, which also has significant clinical
implications, and (4) is based on extremely preterm children within in the age range of
27–48 months, and needs replication with a large cohort of typically developing children.
Adjustment algorithms have also been proposed, which is a more direct approach to
correct for the inflated scores than raising the cut-offs for classifying delay. The EPICure-2
team generated an algorithm to convert Bayley-III Cognitive and Language composite scores
to an equivalent BSID-II MDI (Moore et al., 2012): predicted MDI = 88.8 – (61.6 × (Language
composite/100)−1) + (.67 × Cognitive composite). Other authors have also generated regres-
sion algorithms for converting Bayley-III scores to BSID-II for preterm and term children
(Lowe, Erickson, Schrader, & Duncan, 2012) and survivors of neonatal encephalopathy (Jary
et al., 2013). Such conversion algorithms are useful if the goal is to convert Bayley-III scores
back to an outdated BSID-II score, but as noted previously, the clinical utility of the MDI has
been criticized. In addition, the published algorithms differ for specific clinical populations,
indicating that population-specific algorithms are likely required. Also, conversion algorithms
to date have applied data from rather narrow age bands and it is likely that age-specific
algorithms are also necessary.
Another approach is to use developmental age equivalents to generate a developmental
quotient (DQ) rather than the standardized composite/scale scores (Milne, McDonald, &
Comino, 2012). The formula for generating DQ is: (developmental age / actual age) × 100.
Milne et al. argue that this provides an estimate of the rate of development for individual
children relative to the standardization sample. In a sample of children referred for evaluation
for developmental delay/disability (n = 122, average age 35 months), DQ scores were signif-
icantly lower than standardized Cognitive, Language and Motor composite scores, with a
corresponding increase in the proportion of children classified as delayed, especially mod-
erately to profoundly delayed (DQ: 18.1%; Composite: 7.4). A criticism of this approach is
that it is less precise as it is based on the premise that SDs are equivalent for all ages, but
this is not the case (Aylward, 2013).
In some regions, large cohorts of typically developing children have been assessed on
the Bayley-III for screening purposes, or as participants in research projects. As long as the
cohort is approximately representative of the general population, the distribution charac-
teristics of this group could be used as a guide for interpreting test performance. For example,
some Australian clinicians and researchers are using published data of a Melbourne control
group, in which mean composite scores were between 8 and 18 points higher than the
normative mean of 100 (Anderson et al., 2010), to determine developmental status, and
when appropriate, the severity of delay. For those who can access these data, caution is
needed as these cohorts have usually been assessed within a narrow age band and may not
be translatable to children in younger and older age bands.
As the distribution of the normative sample appears to have shifted to the right (i.e.
upwards) by approximately 7 points, a simplistic approach maybe to subtract 7 points from
the composite scores, or alternatively adjust the classification cut-offs by 7 points. However,
the magnitude of the inflation differs across studies and across developmental domains
8    P. J. Anderson and A. Burnett

with the overestimation highest in the Motor and Language domains. Further, while the
mean inflation rate may be approximately 7 points, this is likely to vary across the distribution
of the composite scores. For example, the scores at the ends of the distribution may be more
inflated, and therefore require greater adjustment, than scores in the middle of the
distribution.
Ultimately, new or revised standardization data for the Bayley Scales are required. The
easiest solution would be to republish the norms excluding the 10% of children from high-
risk clinical populations, as this is the likely cause for the inflated scores, and is consistent
with the approach used for the BSID-II for which the norms were considered relatively accu-
rate. New standardization is unlikely given the Bayley-III was first published in 2006, although
a new edition is likely in the near future as tests are now generally revised and re-standardized
every 10–15 years. Clinicians and researchers can also consider alternative measures for
assessing early developmental status. While there are a few options available, none have
been reviewed or scrutinized as closely as the Bayley Scales, and they may have similar or
different measurement problems.

Predicting later functioning


The capacity of the Bayley Scales to predict later functioning has been examined, despite
the fact that this is not the intent of the test. Given the Bayley Scales is commonly used in
follow-up programs for high-risk children, and as the primary outcome measure in obser-
vational studies and randomized controlled trials, it was inevitable that the sensitivity and
specificity of the test in predicting later outcomes has been investigated. In interpreting
these findings, it is important to appreciate the considerable inter-individual variability
observed in early development, and the expectation that many children who are delayed
early in life will catch up to their peers. Accordingly, one should only expect the Bayley Scales
to be moderately predictive of later functioning. Also, while acknowledging inflated devel-
opment scores on the Bayley-III, it is also important to recognize that school-aged measures
of IQ and motor functioning have measurement error and may explain at least some of the
discordance between early and later assessments.
Prior to the Bayley-III, there had been reports that the Bayley Scales is a poor predictor of
later outcome, such as general intelligence (IQ). In a large cohort of extremely low birth-
weight children, Hack and colleagues reported that BSID-II MDI scores at 20 months of age
were significantly lower and rates of impairment significantly higher than at 8 years of age
when assessed on the Mental Processing Composite score of the Kaufman Assessment
Battery for Children (KABC) (Hack et al., 2005). For moderate-to-severe impairment (<70),
the MDI had a positive predictive value of .37 and negative predictive value of .98, while for
mild-to-severe impairment (<85), the MDI had a positive predictive value of .50 and negative
predictive value of .89. Children who had an MDI ≥ 85 were likely to score in the same range
at 8 years; however, there was much more variability for children who scored lower. For
example, approximately half the children who had an MDI < 70 at 20 months were scoring
within the normal range at 8 years of age. More recently, a meta-analysis was published
exploring the predictive value of the Bayley Scales in relation to later cognitive and motor
functioning in very preterm/very low birth-weight children (Luttikhuizen dos Santos, de
Kieviet, Königs, van Elburg, & Oosterlaan, 2013). The analyses were largely restricted to MDI
and PDI from the BSID-I and BSID-II, with only one eligible study that had used the Bayley-III.
The Clinical Neuropsychologist   9

Based on pooled data from 14 studies (n = 1330 children), the correlation between MDI and
later cognitive functioning was .61 (95% confidence interval (CI): .57–.64), explaining 37%
of the variance. The correlation between PDI and later motor functioning was lower (r = .34,
95%CI: .26–.42), and explained only 12% of the variance.
Fewer studies have examined the sensitivity and specificity of the Bayley-III in relation to
later functioning. Bode and colleagues administered the Cognitive and Language scales of
the Bayley-III at 2 years of age and the WPPSI-III at 4 years of age in children born very preterm
as well as matched controls (Bode, D’Eugenio, Mettelman, & Gross, 2014). The Bayley-III scales
correlated highly with Full-Scale IQ at 4 years (Cognitive – .81, Language – .78), although this
varied accordingly to gestational age at birth with the correlation being highest for those
born earlier. High sensitivity and specificity were reported, indicating that the Bayley-III is
predictive of preschool IQ; however, outcome classifications were determined according to
the distribution of the control group (M, SD) rather than the test norms of the Bayley-III and
WPPSI-III (Bode et al., 2014). To truly evaluate the predictive validity of the Bayley-III, delay/
impairment needs to be judged according to the test norms as this is how the test is designed
to be used.
Using a cohort of very preterm infants (n = 105), the Victorian Infant Brain Studies team
has reported a series of papers examining the capacity of the Bayley-III to predict later cog-
nitive and motor functioning using the test norms to classify impaired/delayed performance
(Spencer-Smith, Spittle, Lee, Doyle, & Anderson, 2015; Spittle et al., 2013). The Bayley-III was
administered at 2 years of age while the Movement Assessment Battery for Children (­ MABC-2)
(Henderson, Sugden, & Barnett, 2007) and the Differential Ability Scales (DAS-II) (Elliott, 2007)
were administered at 4 years of age to assess motor functioning and general intelligence.
While there was a strong correlation between the Bayley-III Motor composite and the MABC-2
percentile rank, the rate of impaired motor performance was significantly higher at 4 years
of age (Spittle et al., 2013). Based on the Bayley-III, 9% of the cohort were classified as mildly
to severely motor delayed and 4% as moderately to severely delayed, while on the MABC-2,
the rate of mild-to-severe impairment was 22%, with 19% exhibiting moderate-to-severe
motor impairment. The sensitivity of the Bayley-III to predict later motor impairment was
very low, but specificity was excellent. In other words, the children who the Bayley-III
­identified as having motor delay at 2 years were very likely to have motor impairment at
4 years, but most of the children with a later motor impairment were not delayed on the
Bayley-III. Likewise, the Bayley-III Cognitive and Language composites correlated highly with
the General Conceptual Ability (GCA) scale of the DAS-II and achieved high specificity but
low levels of sensitivity for predicting later cognitive impairment (Spencer-Smith et al., 2015).
Thus, the Bayley-III tends to under-identify later cognitive and motor impairment.

Conclusion
The Bayley-III is the most widely used measure for assessing early developmental status,
specifically cognitive, language and motor delay. However, there is now considerable evi-
dence demonstrating that the Bayley-III overestimates development, resulting in a misclas-
sification of developmental delay. Concerningly, a significant proportion of children who
are performing age appropriately on the Bayley-III are actually delayed, and some children
may not be receiving important services, including early intervention, because they are
being misclassified by the Bayley-III. In terms of predicting later functioning, evidence to
10    P. J. Anderson and A. Burnett

date suggests that the Bayley-III tends to under-identify later cognitive and motor impair-
ments. While this is to be expected given the dynamic nature of development, we expect
the sensitivity of the Bayley-III to be poorer than earlier versions of the Bayley Scales. A
number of strategies have been proposed for dealing with the inflated scores, and while
each approach has merit for specific applications, none are ideal. We believe the best solu-
tions are (1) to publish new norms excluding the high-risk children added to the initial
standardization sample, (2) re-standardize the Bayley-III, or (3) devise a new edition of the
Bayley Scales with improved norms. In the meantime, in order to assist with interpreting
Bayley-III test scores, clinicians and researchers are encouraged to utilize Bayley-III data col-
lected on cohorts of typically developing children representative of their region, if available.
We do not recommend reverting to the BSID-II, or using algorithms to convert Bayley-III
scores to BSID-II scores, except when it is necessary to compare with earlier studies.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by Australia’s National Health and Medical Research Council [grant number
1081288].

References
Acton, B. V., Biggs, W. S., Creighton, D. E., Penner, K. A., Switzer, H. N., Thomas, J. H., … Robertson, C. M.
(2011). Overestimating neurodevelopment using the Bayley-III after early complex cardiac surgery.
Pediatrics, 128, e794–e800.
Anderson, P. J., De Luca, C. R., Hutchinson, E., Roberts, G., Doyle, L. W., & Victorian Infant Collaborative,
G. (2010). Underestimation of developmental delay by the new Bayley-III Scale. Archives of Pediatrics
& Adolescent Medicine, 164, 352–356.
Aylward, G. P. (2013). Continuing issues with the Bayley-III. Journal of Developmental and Behavioral
Pediatrics, 34, 697–701.
Aylward, G. P., & Aylward, B. S. (2011). The changing yardstick in measurement of cognitive abilities in
infancy. Journal of Developmental and Behavioral Pediatrics, 32, 465–468.
Bayley, N. (1969). Manual for the Bayley Scales of infant development. San Antonio, TX: The Psychological
Corporation.
Bayley, N. (1993). Bayley Scales of infant development, second edition: Manual. San Antonio, TX: The
Psychological Corporation.
Bayley, N. (2006a). Bayley Scales of infant and toddler development, third edition technical manual.
San Antonio, TX: Pearson PsychCorp.
Bayley, N. (2006b). Bayley Scales of infant and toddler development (3rd ed.). San Antonio, TX: Pearson
PsychCorp.
Bode, M. M., D’Eugenio, D. B., Mettelman, B. B., & Gross, S. J. (2014). Predictive validity of the Bayley, Third
Edition at 2 years for intelligence quotient at 4 years in preterm infants. Journal of Developmental
and Behavioral Pediatrics, 35, 570–575.
Doyle, O., Harmon, C. P., Heckman, J. J., & Tremblay, R. E. (2009). Investing in early human development:
Timing and economic efficiency. Economics & Human Biology, 7(1), 1–6.
Elliott, C. (2007). Differential ability scales-II (DAS-II). San Antonio, TX: Harcourt Assessment.
Flynn, J. R. (1999). Searching for justice – The discovery of IQ gains over time. American Psychologist,
54, 5–20.
Greene, M. M., Patra, K., Silvestri, J. M., & Nelson, M. N. (2013). Re-evaluating preterm infants with the
Bayley-III: Patterns and predictors of change. Research in Developmental Disabilities, 34, 2107–2117.
The Clinical Neuropsychologist   11

Greenspan, S. I. (2004). Greenspan social–emotional growth chart: A screening questionnaire for infants
and young children. San Antonio, TX: Harcourt Assessment.
Hack, M., Taylor, H. G., Drotar, D., Schluchter, M., Cartar, L., Wilson-Costello, D., & Morrow, M. (2005). Poor
predictive validity of the Bayley Scales of infant development for cognitive function of extremely
low birth weight children at school age. Pediatrics, 116, 333–341.
Harrison, P. L., & Oakland, T. (2003). Adaptive behavior assessment system–Second Edition. San Antonio,
TX: The Psychological Corporation.
Henderson, S., Sugden, D., & Barnett, A. (2007). The movement assessment battery for children (2nd ed.).
London: The Psychological Corporation.
Jary, S., Whitelaw, A., Walløe, L., & Thoresen, M. (2013). Comparison of Bayley-2 and Bayley-3 scores
at 18 months in term infants following neonatal encephalopathy and therapeutic hypothermia.
Developmental Medicine & Child Neurology, 55, 1053–1059.
Johnson, S., Moore, T., & Marlow, N. (2014). Using the Bayley-III to assess neurodevelopmental delay:
Which cut-off should be used? Pediatric Research, 75, 670–674.
Lobo, M. A., Paul, D. A., Mackley, A., Maher, J., & Galloway, J. C. (2014). Instability of delay classification and
determination of early intervention eligibility in the first two years of life. Research in Developmental
Disabilities, 35, 117–126.
Lowe, J. R., Erickson, S. J., Schrader, R., & Duncan, A. F. (2012). Comparison of the Bayley II mental
developmental index and the Bayley III cognitive scale: Are we measuring the same thing? Acta
Paediatrica, 101, e55–e58.
Luttikhuizen dos Santos, E. S., de Kieviet, J. F., Königs, M., van Elburg, R. M., & Oosterlaan, J. (2013).
Predictive value of the Bayley Scales of infant development on development of very preterm/very
low birth weight children: A meta-analysis. Early Human Development, 89, 487–496.
Milne, S., McDonald, J., & Comino, E. J. (2012). The use of the Bayley Scales of infant and toddler
development III with clinical populations: A preliminary exploration. Physical & Occupational Therapy
in Pediatrics, 32, 24–33.
Moore, T., Johnson, S., Haider, S., Hennessy, E., & Marlow, N. (2012). Relationship between test scores
using the second and third editions of the Bayley Scales in extremely preterm children. The Journal
of Pediatrics, 160, 553–558.
Pena, E. D., Spaulding, T. J., & Plante, E. (2006). The composition of normative groups and diagnostic
decision making: Shooting ourselves in the foot. American Journal of Speech-Language Pathology,
15, 247–254.
Reuner, G., Fields, A. C., Wittke, A., Löpprich, M., & Pietz, J. (2013). Comparison of the developmental
tests Bayley-III and Bayley-II in 7-month-old infants born preterm. European Journal of Pediatrics,
172, 393–400.
Silveira, R. C., Filipouski, G. R., Goldstein, D. J., O’Shea, T. M., & Procianoy, R. S. (2012). Agreement between
Bayley Scales second and third edition assessments of very low-birth-weight infants. Archives of
Pediatrics & Adolescent Medicine, 166, 1075–1076.
Spencer-Smith, M. M., Spittle, A. J., Lee, K. J., Doyle, L. W., & Anderson, P. J. (2015). Bayley-III cognitive
and language scales in preterm children. Pediatrics, 135, e1258–e1265.
Spittle, A., Orton, J., Anderson, P. J., Boyd, R., & Doyle, L. W. (2015). Early developmental intervention
programmes provided post hospital discharge to prevent motor and cognitive impairment in preterm
infants. Cochrane Database of Systematic Reviews, 11, Art. No. CD005495. doi:10.1002/14651858.
CD005495.pub4
Spittle, A. J., Spencer-Smith, M. M., Eeles, A. L., Lee, K. J., Lorefice, L. E., Anderson, P. J., & Doyle, L. W.
(2013). Does the Bayley-III Motor Scale at 2 years predict motor outcome at 4 years in very preterm
children? Developmental Medicine and Child Neurology, 55, 448–452.
Squires J., & Bricker D. (2009). Ages & stages questionnaires. Third edition. (ASQ-3): A parent-completed
child-monitoring system. Baltimore, MD: Paul Brookes.
Steenis, L. J. P., Verhoeven, M., Hessen, D. J., & van Baar, A. L. (2015). Performance of Dutch Children on
the Bayley III: A Comparison Study of US and Dutch Norms. PLoS ONE, 10, e0132871. doi:10.1371/
journal.pone.0132871
Vohr, B. R., Stephens, B. E., Higgins, R. D., Bann, C. M., Hintz, S. R., Das, A., … Fuller, J. (2012). Are outcomes
of extremely preterm infants improving? Impact of Bayley assessment on outcomes. The Journal of
Pediatrics, 161, 222–228.

You might also like