You are on page 1of 17

Article

A Motor Speech Assessment for Children With


Severe Speech Disorders: Reliability and Validity
Evidence
Edythe A. Strand,a Rebecca J. McCauley,b Stephen D. Weigand,a Ruth E. Stoeckel,a and Becky S. Baasa

Purpose: In this article, the authors report reliability and validity intrajudge reliability, and 91% for interjudge reliability.
evidence for the Dynamic Evaluation of Motor Speech Skill Agglomerative hierarchical cluster analysis showed that total
(DEMSS), a new test that uses dynamic assessment to aid in DEMSS scores largely differentiated clusters of children with
the differential diagnosis of childhood apraxia of speech (CAS). CAS vs. mild CAS vs. other speech disorders. Positive and
Method: Participants were 81 children between 36 and 79 negative likelihood ratios and measures of sensitivity and
months of age who were referred to the Mayo Clinic for specificity suggested that the DEMSS does not overdiagnose
diagnosis of speech sound disorders. Children were given the CAS but sometimes fails to identify children with CAS.
DEMSS and a standard speech and language test battery as Conclusions: The value of the DEMSS in differential
part of routine evaluations. Subsequently, intrajudge, diagnosis of severe speech impairments was supported on
interjudge, and test–retest reliability were evaluated for a the basis of evidence of reliability and validity.
subset of participants. Construct validity was explored for all
81 participants through the use of agglomerative cluster
analysis, sensitivity measures, and likelihood ratios. Key Words: childhood apraxia of speech, dynamic
Results: The mean percentage of agreement for 171 assessment, test development, differential diagnosis,
judgments was 89% for test–retest reliability, 89% for pediatric motor speech disorders, speech sound disorders

P
ediatric speech sound disorders (SSDs) result from a have little or no functional verbal communication but can at
variety of etiologies and represent impairment at a least attempt imitation.
number of different levels of speech production, The DEMSS is specifically designed to identify those
including linguistic–phonologic and/or motor speech. One of children who have difficulty with praxis for speech. This level
the many challenges in differential diagnosis of SSD is of motoric difficulty is now most commonly designated by
determining to what degree motor speech impairment (i.e., the label childhood apraxia of speech (CAS; e.g., American
deficits in planning–programming movement sequences Speech-Language-Hearing Association [ASHA], 2007;
and/or executing the movement) contributes to the child’s Terband & Maassen, 2010). Isolating deficits in the ability to
SSD. In this article, we report evidence regarding the plan and program movement transitions between articula-
reliability and validity of the Dynamic Evaluation of Motor tory postures for volitional speech in children with SSD is
Speech Skill (DEMSS), a new instrument intended to address difficult, in part because speech and language processes are
this challenge. The DEMSS is intended to assist in interactive (Goffman, 2004; Kent, 2004; Strand, 1992).
differential diagnosis of SSD in younger children and/or Because these motor deficits often occur along with deficits
those with more severe impairments, including children who in phonology, symptomatology may reflect a combination of
linguistic (phonologic) and motor speech deficits (CAS and/
or dysarthria; Crary, 1993; Rvachew, Hodge, & Ohberg,
2005; Smith & Goffman, 2004). In the absence of a
a
Mayo Clinic, Rochester, Minnesota physiologic marker for CAS (Shriberg, 2003; Shriberg,
b
The Ohio State University, Columbus Aram, & Kwiatkowski, 1997), clinicians are left to make this
Correspondence to Edythe A. Strand: strand.edythe@mayo.edu diagnosis on the basis of behavioral characteristics.
Editor: Jody Kreiman A number of possible discriminative behavioral
Associate Editor: Julie Liss characteristics for CAS have been proposed (e.g., Davis,
Received March 24, 2012 Jakielski, & Marquardt, 1998; Ozanne, 1995; Shriberg et al.,
Revision received July 9, 2012 2003). In an effort to encourage greater uniformity in
Accepted August 18, 2012 methods used for diagnosis, treatment, and study of CAS,
DOI: 10.1044/1092-4388(2012/12-0094) ASHA (2007) undertook an examination of this area of

Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013 N ß American Speech-Language-Hearing Association 505
practice. On the basis of an extensive review of the literature this report is designed to describe our efforts in developing an
and expert testimony, authors of the resulting position MSE supported by more extensive evidence of reliability and
statement came to the following conclusion: validity.
Reliability and validity. Reliability evidence is funda-
[T]hree segmental and suprasegmental features that are
mental to a test’s development because measures of reliability
consistent with a deficit in the planning and
estimate the degree to which a test is vulnerable to various
programming of movements for speech have gained
sources of error (e.g., variation due to readministrations,
some consensus among investigators in apraxia of
testers, or within-tester inconsistencies), which constitute
speech in children: (a) inconsistent errors on
threats to the test’s validity. This is especially true for an
consonants and vowels in repeated productions of
MSE for children because of the high potential for error
syllables or words, (b) lengthened and disrupted co-
presented by the nature of young children and their response
articulatory transitions between sounds and syllables;
to the artificialities of speech testing (Kent, Kent, &
and (c) inappropriate prosody, especially in the
Rosenbek, 1987).
realization of lexical or phrasal stress. (p. 1)
Validity refers to the degree a test measures what it
Although these characteristics are broadly endorsed, purports to measure. Validity evidence for a test’s use in
the position statement contains the acknowledgment that no diagnosis—that is, evidence supporting its contribution to
list of specific characteristics has yet been empirically accurate diagnosis—can be provided using several different
validated for differential diagnosis. approaches (Downing & Haladyna, 2006). Contrasting
Although ASHA’s (2007) position statement and other groups and correlations between the test under study and a
frequently cited publications (e.g., Davis et al., 1998) posit gold standard (i.e., acknowledged valid measure) are
behavioral characteristics as discriminative for the diagnosis particularly common methods (McCauley, 2001). Cluster
of CAS, they do not directly lead to assessment procedures. analysis—a statistical method that divides data into mean-
Typical strategies used in the clinical assessment of SSDs ingful groups sharing common characteristics—may provide
include taking a thorough history describing phonetic and an equally compelling method for exploration of a test’s
phonemic inventories, administering oral structural–func- validity for use in diagnosis. In studies of communication
tional examinations, as well as giving articulation tests and/ disorders, researchers have used cluster analysis primarily to
or measures of phonologic performance. Tests of articulation (a) probe constructs such as the existence of homogeneous
and phonology are designed for children who have at least a subgroups within broader diagnoses, including autism
rudimentary speech sound inventory. Yet, there is a need for spectrum disorders (e.g., Verté et al., 2006), specific language
a tool to examine the performance of children who are very disorders (e.g., Conti-Ramsden & Botting, 1999), and SSDs
young and/or very limited in their speech production skills, (Arndt, Shelton, Johnson, & Furr, 1977; Johnson, Shelton, &
including those who are nonverbal. Arndt, 1982; Peter & Stoel-Gammon, 2008) and (b) identify
clusters of co-occurring speech and nonspeech characteristics
Motor Speech Examination in CAS (Ball, Bernthal, & Beukelman, 2002).
When used in test development, cluster analysis allows
One type of assessment tool that may be adapted for one to begin with a heterogeneous group of children (e.g.,
use with these younger children is a motor speech examination children with SSDs) and determine to what extent the test
(MSE), which is frequently used with adults to determine the and its components (e.g., subscores) can identify meaningful
presence of, or to rule out difficulty with, speech motor subgroups—for example, ones that might be related to motor
planning and programming (Duffy, 2005; McNeil, Robin, & speech status or other unanticipated participant variables.
Schmidt, 2009; Yorkston, Beukelman, Strand, & Hackel, When test-determined clusters match other bases for
2010). It allows the clinician to observe speech production participant groupings (e.g., diagnosis using different
across utterances that vary in length and phonetic complexity methods), it provides valuable evidence of the test’s construct
using hierarchically organized stimuli—conditions that (diagnostic) validity. This report on the DEMSS makes use
systematically vary programming demands. It also allows the of this type of evidence.
clinician to make observations of behaviors frequently
associated with deficits in speech praxis, including consonant
and vowel distortions due to lengthened and/or distorted The Dynamic Evaluation of Motor Speech Skill
movement transitions, timing errors, dysprosody, and As a motor speech examination, the DEMSS system-
inconsistency across repeated trials. However, MSEs are less atically varies the length, vowel content, prosodic content,
frequently used in assessing child SSDs (Skahan, Watson, & and phonetic complexity within sampled utterances. Thus, it
Lof, 2007), and, when used, they often lack evidence of is not intended to serve as yet another articulation test or test
psychometric quality (e.g., Guyette, 2001; McCauley, 2003; of phonologic proficiency, both of which typically sample all
McCauley & Strand, 2008). Specifically, McCauley and segments in a language. Rather the DEMMS is designed
Strand (2008) found that of six published tests designed to specifically to examine the speech movements of younger
assist in the diagnosis of motor speech disorders, only one— children and/or children who are more severely impaired,
the Verbal Motor Production Assessment for Children even those who may not yet produce many sounds, syllables,
(Hayden & Square, 1999)—provided evidence of validity, or words. Therefore, instead of sampling all American
and none provided adequate evidence of reliability. Thus, English speech sounds, the DEMSS focuses on earlier

506 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013
developing consonant sounds paired with a variety of vowels segmentation, timing errors, or other characteristics asso-
in several earlier developing syllable shapes. The stimuli and ciated with CAS that are frequently not evident in
the scoring system are designed to allow examination of spontaneous utterances or in noncued repetitions.
characteristics that have been frequently posited or reported The cuing used in dynamic assessment has the
in the literature to be associated with CAS (e.g., ASHA, potential to facilitate judgments of severity and therefore
2007; Campbell, 2003; Davis et al., 1998; Strand, 2003), prognosis, as well as to facilitate treatment planning. For
including lengthened and disrupted coarticulatory transitions example, if a child consistently needs considerable cuing to
between sounds and syllables (articulatory accuracy), correctly produce a target or never produces it correctly
inconsistency of errors on consonants and vowels across despite cuing, his or her problem is seen as more severe and
repeated trials, and prosodic accuracy such as lexical stress. the prognosis for rapid improvement as more guarded (Peña
The DEMSS comprises nine subtests (see Table 1), et al., 2006). Treatment planning is facilitated in that the
consisting of 66 utterances. These 66 utterances are types of cues that proved helpful during the administration
associated with 171 items (i.e., judgments) that contribute to of the test suggest cuing strategies that are likely to be useful
four types of subscores (overall articulatory accuracy of the in treatment. Further, reviewing errors on specific vowels
word, vowel accuracy, prosodic accuracy, and consistency). and across particular syllable shapes facilitates choices of
Judgments of overall articulatory accuracy are made for all content and complexity of early stimulus sets. Because the
66 utterances, judgments of vowel accuracy are made for 56 entire DEMSS uses dynamic assessment, judgments of
of these utterances, judgments of prosodic accuracy (viz., severity and prognosis are facilitated beyond tools currently
lexical stress accuracy) are made for 21 of the utterances, and available.
judgments of consistency are made for 28 of the utterances. During administration of the DEMSS, the clinician
The DEMSS total score is the sum of the four subscores asks the child to imitate a series of words, with the child’s
(overall articulatory accuracy, vowel accuracy, prosodic eyes directed to the clinician’s face as much as possible.
accuracy, and consistency). Items are scored either during Depending on the child’s initial imitation of an utterance, the
testing or from a videotape sample following administration clinician may elicit one or more additional imitative attempts,
of the test. with various levels of cuing, before scoring is completed.
The DEMSS uses dynamic assessment (Bain, 1994; Visual, temporal, and tactile cues are implemented to help the
Glaspey & Stoel-Gammon, 2007; Lidz & Peña, 1996), in child improve accuracy of production over repeated trials,
which multiple attempts are elicited for scoring as the with multidimensional scoring reflecting his or her respon-
clinician uses cues and other strategies (e.g., slowed rate or siveness to cuing. (Please refer to the Appendix for examples
simultaneous production) designed to facilitate performance. of the multidimensional scoring.)
Dynamic assessment may be especially important in asses- Table 2 describes basic rules for assigning specific
sing SSDs. For some children who have few sounds, scores within the four subscores—overall articulatory
syllables, or words, even modest support may allow them to accuracy, vowel accuracy, prosody (lexical stress accuracy),
produce the utterance more accurately, showing emerging and consistency (with higher scores reflecting poorer
skills. Further, when the child is attempting to imitate performance). Only vowel accuracy and prosody are always
specific speech movements with cuing, he or she may increase scored on the first attempt at an utterance. Overall
attention and/or effort toward achieving a particular spatial articulatory accuracy may be scored based on subsequent
or temporal target. This allows observation of groping, trials. Consistency is always scored after all attempts on an

Table 1. Dynamic Evaluation of Motor Speech Skills (DEMSS) content coverage.

No. of items judged for each subscore


(range of possible scores)
No. of Overall articulatory Vowel Prosodic
Utterance type (examples) utterances accuracy accuracy accuracy Consistency

CV (me, hi) 8 8 (0–32) 8 (0–16) 4 (0–4)


VC (up, eat) 8 8 (0–32) 8 (0–16) 4 (0–4)
Reduplicated syllables (mama, booboo) 4 4 (0–16) 4 (0–4)
CVC1 (mom, peep, pop) 6 6 (0–24) 6 (0–12) 6 (0–6)
CVC2 (mad, bed, hop) 8 8 (0–32) 8 (0–16) 8 (0–8)
Bisyllabic 1 (baby, puppy) 5 5 (0–20) 5 (0–10) 5 (0–5)
Bisyllabic 2 (bunny, happy) 6 6 (0–24) 6 (0–6)
Multisyllabic (banana, kangaroo) 6 6 (0–24) 6 (0–12) 6 (0–6) 6 (0–6)
Utterances of increasing length (dad, hi 15 15 (0–60) 15 (0–30)
dad, hi daddy)
Total utterances 66 66 56 21 28

Note. Only items within specific utterance types contribute to subscores for consistency, vowel accuracy, or prosodic accuracy.

Strand et al.: A Motor Speech Assessment for Children 507


Table 2. DEMSS scoring. demonstrate the construct validity of the DEMSS for use in
identifying children who have speech motor difficulty,
Assigning specific scores especially difficulty with praxis, and therefore need to have
motoric approaches included in treatment strategies.
Overall articulatory accuracy: 5-point multidimensional scoring
0 = correct on first attempt
1 = consistent developmental substitution error (e.g., /t/ for /k/; Method
/w/ for /r/) without slowness or distortion of movement
gestures Participants
2 = correct after first cued attempt
3 = correct after two or three additional cued attempts This study was approved by the Mayo Clinic
4 = not correct after all cued attempts Institutional Review Board, and legal guardians of partici-
Vowel accuracy: 3-point multidimensional scoring pants agreed to the children’s participation. Participants
0 = correct included 81 children (63 males, 18 females) between the ages
1 = mild distortion of 36 and 79 months who were consecutively referred for
2 = frank distortion evaluations at the Mayo Clinic for concerns regarding SSDs.
Prosodic accuracy: Binary scoring Exclusionary criteria included structural deficits (e.g., cleft
0 = correct palate), hearing loss, English as a second language, autism
1 = incorrect spectrum disorder, and dysarthria. These were determined by
Consistency: Binary scoring the medical record, history, and clinical examination. In the
0 = consistent across all trials case of dysarthria, the structural functional examination and
1 = inconsistent across any 2 or more trials clinical observations were also used. Children with dysarthria
were excluded from participation because the DEMSS was
not designed to prove useful in its differential diagnosis
item have been made. If the child’s initial repetition is
(Yorkston et al., 2010). A specific cutoff for cognitive skill
correct, the utterance is also scored for overall articulatory
was not determined for inclusion in the study. Participants
accuracy on the first trial. Then, the child is asked to imitate
needed to be able to attend to the clinician for the duration
the utterance again so that consistency can be judged. If the
of the DEMSS, attempt the direct imitation, and tolerate
initial response is incorrect, overall articulatory accuracy and
cuing. Table 3 reports descriptive data for all participants.
consistency are scored on the basis of the child’s subsequent
efforts in response to a cuing hierarchy. Specifically, after an
initial incorrect attempt, the examiner provides another Procedure
auditory model while using a gesture to highlight the All participants completed a comprehensive test
clinician’s articulatory configuration (e.g., pointing with battery typically given at the Mayo Clinic for children who
thumb and forefinger to rounded or closed lips) and making are seen for concerns regarding SSDs. Assessment was
sure the child is watching the clinician’s face. If the response completed by one of four certified speech-language patho-
is still incorrect, the clinician provides additional cuing (e.g., logists who provide pediatric assessments at the Mayo Clinic.
tactile cuing and/or has the child repeat the utterance The assessment battery included standardized receptive and
simultaneously with the clinician, using slower movement expressive language testing using the receptive and expressive
gestures). After three or four unsuccessful attempts with subtests of the oral language scales of the Oral and Written
cuing, the utterance is scored as incorrect, thereby receiving a Language Scales (Carrow-Woolfolk, 1997) or the Preschool
score of 4 for overall articulatory accuracy. Language Scales, Fourth Edition (PLS–4; Zimmerman,
Despite the careful selection of stimuli, cuing methods, Steiner, & Pond, 2002) as well as the Peabody Picture
and scoring procedures undertaken during the development Vocabulary Test—III (PPVT–III; Dunn & Dunn, 1997). A
of the DEMSS, its value as an MSE can be firmly asserted 15-min language sample was elicited through play and by
only following empirical demonstrations of its reliability and engaging the child in topics of interest. The sample was used
validity. In particular, further development of the DEMSS clinically to obtain mean length of utterance, to obtain
required that it be studied to determine the extent to phonetic and phonemic inventories, and to make observa-
which its scores were consistent across different testing tions regarding morphology and syntax. The Evaluation of
occasions and test givers and accurate when used for Oral Function and Praxis (EOFP), an unpublished oral
diagnostic purposes. structural functional examination similar to that described
by Strand and McCauley (1999), was used to examine
neuromuscular status, cranial nerve function (strength, speed
Purpose of movement, range of motion, etc., for each of the oral
The purpose of this study was to examine the reliability structures), and oral nonverbal praxis. The Goldman Fristoe
and validity of this dynamic test of motor speech skill in Test of Articulation—Second Edition (GFTA–2; Goldman
children. Specifically, we studied the DEMSS’s intraexami- & Fristoe, 2000) was used to examine accuracy of consonant
ner, interexaminer, and test–retest reliability. We also production in an isolated word context and to identify
examined its ability to distinguish subgroups within a larger relevant consistent developmental substitution errors. The
group of 81 children with SSDs. Our goal of this work was to DEMSS was used for the motor speech examination by three

508 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013
Table 3. Participant characteristics ordered by DEMSS score within Table 3. (Continued).
diagnostic group.
Age
Age Group ID DEMSS GFTA Sex (mos) PPVT RLS ELS
Group ID DEMSS GFTA Sex (mos) PPVT RLS ELS
Mild CAS 66 54 56 M 56 119 99 123
Non-CAS 6 0 97 F 72 93 100 75 57 74 55 F 63 99 94 85
8 0 105 F 62 114 105 115 16 99 47 M 71 86 69 65
65 4 69 M 59 126 119 115 9 112 50 M 68 90 84 70
59 4 72 M 62 119 94 108 30 123 62 F 45 123 90 92
50 4 97 M 50 101 81 95 28 145 40 F 69 84 74 75
45 6 61 M 72 77 93 93 67 159 65 M 44 83 73 72
17 7 92 F 54 120 102 89 56 159 81 M 37 90 — —
5 7 97 M 61 108 85 87 Severe CAS 41 46 46 M 69 102 103 99
75 10 83 M 62 119 122 101 25 73 58 M 48 85 76 86
43 11 80 M 52 123 110 95 70 161 < 40 M 71 80 — 64
42 11 86 M 53 115 110 103 82 212 71 M 53 106 89 74
49 12 106 M 49 97 87 93 39 232 78 M 38 90 90 80
35 13 97 M 61 87 82 84 40 237 40 M 74 88 92 —
2 17 80 M 66 110 108 100 64 240 40 M 70 75 59 64
61 18 84 M 38 89 81 80 22 260 52 F 45 101 87 79
14 18 91 M 51 100 95 95 55 270 40 F 79 68 — —
47 19 77 M 60 77 85 80 77 328 60 M 84 64 50 50
80 20 54 M 63 109 97 80 53 337 59 M 37 98 86 90
81 20 77 M 46 101 103 93 33 425 40 M 43 95 84 57
24 20 91 M 64 111 97 102
54 21 106 M 39 129 112 122 Note. Em dashes indicate that child attention or lack of time
23 22 71 F 53 99 104 103 prevented completion of the entire test and of scoring. ID =
36 22 88 F 40 110 116 114 identification; GFTA = standard score on the Goldman Fristoe Test
21 22 92 M 64 94 78 73 of Articulation—Second Edition; PPVT = Peabody Picture
26 23 90 F 58 — 85 84 Vocabulary Test; RLS = Receptive Language standard score on the
72 24 40 M 75 92 97 73 Oral and Written Language Scales or the Preschool Language
46 25 46 F 61 — 85 99 Scales; ELS = Expressive Language standard score on the Oral and
38 25 85 M 45 98 98 92 Written Language Scales or the Preschool Language Scales; CAS =
78 27 60 M 61 112 115 88 childhood apraxia of speech; M = male; F = female.
73 28 77 F 45 119 107 92
83 29 73 M 47 103 107 96
19 30 80 M 44 93 90 69
12 32 78 M 48 95 92 86 of the four assessing clinicians. Two clinicians (fourth and
18 34 95 F 44 106 86 92 fifth authors) underwent training on the test. The test’s
44 38 45 M 72 93 88 87 developer (first author) spent one session with them,
15 41 107 M 39 126 96 100
27 42 49 M 67 78 58 70
explaining and demonstrating the administration and scor-
29 42 78 M 40 102 104 100 ing. The clinicians were also given a sheet of scoring
74 45 78 M 47 77 88 74 instructions. The first author then co-administered or
63 46 70 M 56 63 73 67 observed video recordings of at least five of each of their
62 50 78 M 45 118 107 110
administrations prior to beginning the study. Feedback was
10 52 80 M 64 63 67 63
48 54 82 M 56 92 91 78 provided on both methods of elicitation and scoring. For
60 55 78 M 57 102 83 78 those children assessed by the fourth assessing clinician, the
52 57 73 M 56 95 95 80 DEMSS was administered by the first author, the fourth
11 62 69 M 73 70 69 59 author, or the fifth author immediately following their
76 62 76 M 45 91 90 83
3 63 83 F 39 109 114 112 evaluation.
34 82 81 M 36 92 92 102 Given that the study was conducted in the context of
1 82 93 M 39 97 94 81 standard clinical practice at the Mayo Clinic, three of the
58 83 62 F 54 112 105 93 four clinicians who conducted the clinical exam and speech
32 85 82 F 53 77 — —
4 95 60 M 71 103 78 86 diagnosis also administered the DEMSS as part of the
71 101 56 F 47 103 104 86 assessment protocol. In order to reduce possible bias, the
20 103 81 M 36 94 85 86 subscores (overall articulatory accuracy, vowel accuracy,
31 106 71 M 39 108 96 100 prosodic accuracy, and consistency) and total scores on the
7 109 84 M 46 77 — —
13 113 60 M 45 110 113 119 DEMSS were not calculated prior to dictating the report,
51 120 49 M 61 88 81 70 which was done soon after the assessment session. These
37 134 69 M 43 85 86 82 reports included a statement of differential diagnosis based
79 205 40 M 56 83 76 82 on the examiner’s clinical observations and examination of
(table continues) language and articulation test results. Although observations

Strand et al.: A Motor Speech Assessment for Children 509


included those made during administration of the DEMSS as percentages of agreement as well as intraclass correlation
well as all other assessment tasks, the DEMSS scores were coefficients (ICCs) of the DEMSS total score and subscores.
not explicitly considered in the diagnosis. As a measure of the correlation among repeated measure-
In our clinical practice, diagnoses related to apraxia of ments on the same participant, the ICC can be thought of as
speech typically indicate whether the difficulty with praxis the proportion of total variability in a measurement that can
for speech movement is the primary deficit, which is reported be attributed to participant-to-participant variability
as CAS. Alternatively, if difficulty with speech praxis (Streiner & Norman, 2003). Because high ICC values suggest
contributes in a lesser or mild way to the child’s SSD, it is that other sources of variability are relatively small, they
reported as either ‘‘phonologic impairment with mild deficits provide evidence of reliability. For example, if 90% of the
in motor planning/programming (mild CAS)’’ or ‘‘articula- variability in a measurement was attributed to participant-to-
tion errors with evidence for mild CAS.’’ (This is because participant variability and 10% was attributed to within-
CAS occurs on a continuum of severity. Some children have participant variability, the ICC would be 0.90. We estimated
such severe deficits in praxis that they have trouble with even the ICC using random-intercept mixed-effects linear regres-
CVC syllables. Others may exhibit various degrees of sion models (Pinheiro & Bates, 2002), in which one treats
characteristics of apraxia, even though the primary problem participant as a random effect, with the fixed-effects portion
may be phonology or residual articulation errors.) For the of the model including only an intercept.
purposes of this article, this clinical distinction is dichot-
omized as CAS or mild CAS (mCAS), respectively.
Validity Procedures
We used two complementary approaches to assess
Data Preparation and Handling validity, one that does not use CAS status a priori and the
Data for each child were recorded on assessment other that does. In the first approach, which does not use
protocol sheets by the clinician and then were entered into a CAS status a priori, we used cluster analysis (Hastie,
database by professional data entry staff. Accurate data Tibshirani, & Friedman, 2003) to determine the degree to
entry was ensured through double data-entry techniques in which the DEMSS identifies meaningful subgroups of
which the data were entered by each of two independent data children with speech disorders. In the second approach,
entry personnel, results were compared, and any discrepan- which does take a priori CAS status into account, we
cies were resolved. Children were identified only by a examined the ability of the DEMSS scores to discriminate
sequential study identifier. between children who were clinically classified as having
CAS or mCAS versus those who were not.
Reliability Procedures and Analyses
We analyzed three components of DEMSS reliability: Cluster Analysis Methods
(a) test–retest reliability, (b) intrajudge reliability, and (c) We used a hierarchical agglomerative cluster analysis
interjudge reliability of the instrument using subgroups of the to identify groups of children with similar profiles of
81 participants. This type of sampling was used in other performance on the DEMSS subscores independent of their
studies of reliability (McLeod, Harrison, & McCormack, clinical diagnosis. We chose this method of cluster analysis
2012; Wetherby, Allen, Cleary, Kublin, & Goldstein, 2002). because it provides a graphic representation of a series of
Included in the test–retest analysis were results from 11 clusterings (e.g., two clusters, then three clusters, etc.), which
children who were able to return for readministration of the can more informatively summarize the heterogeneity of the
DEMSS within 1 week of their initial test. (The first and subject group than alternative methods such as k-means
fourth authors readministered to seven and four children, cluster analysis, which does not provide this sequence of
respectively.) Twelve randomly chosen children were used for clusters (Hastie et al., 2003). The four DEMSS subscores
the intrajudge reliability (15%). The first and fourth authors (overall articulatory accuracy, vowel accuracy, prosody, and
each rescored the DEMSS for six of those 12 randomly consistency) were the four variables considered. Because of
selected participants based on review of the videotape of their positive skewness, each variable was transformed using the
original administration of the instrument. Twenty children square root transformation. So that each variable was on a
were randomly chosen for the interjudge reliability analyses common scale, we converted transformed subscores to z
(25%). We chose to use a larger percentage because we scores by subtracting the variable mean from each raw score
wanted to make a strong case for reliability across clinicians. then dividing the result by the variable SD.
In this case, the third author scored 10 of those DEMSS The cluster analysis algorithm starts with each
administered by the first author and 10 of those DEMSS participant forming his or her own cluster. Next, the two
administered by the fourth author, using the videotapes of clusters that are least dissimilar are merged. The algorithm
the original assessment. Because reliability is more difficult to repeats this procedure until there are only two remaining
obtain for children who have more severe impairment, we clusters and stops after merging these two clusters into one
examined the randomly selected sample for levels of severity single cluster consisting of all participants. The dissimilarity
and found that 25% of the children randomly selected for the measure we used is the Euclidian distance, defined as the
examination of reliability had severe impairment. To square root of the sum of squared differences. For example,
evaluate the three components of reliability, we examined if the four normalized subscores for Participant A were {0.3,

510 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013
0.5, 0.6, 1.1} and for Participant B were {0.7, 0.2, 0.5, 0.0}, subtest is the score that separates scores considered indicative
the dissimilarity between the two participants would be of disorder from those that are considered indicative of no
calculated as the square root of (0.3 – 0.7)2 + (0.5 – 0.2)2 + disorder. The cut points used in this study were empirically
(0.6 – 0.5)2 + (1.1 – 0.0)2. The dissimilarity between two derived from univariate binary logistic regression models.
clusters is defined as the mean of all pairwise dissimilarities The cut points were defined as the smallest value of the
between participants. The cluster analysis is summarized variable (total score or subscore) such that the estimated
using a dendrogram, which graphically represents a series of probability of having CAS (viz., having a clinical diagnosis
nested clusters in such a way that the members of each cluster of CAS or mCAS) is greater than 50%.
are individually identified and the distance or dissimilarity Because sensitivity and specificity are both affected by
between clusters is indicated by the vertical axis. the sample base rate (i.e., the likelihood that affected
To supplement the cluster approach, we used the gap individuals will be included in the sample), we also calculated
statistic to assess the number of clusters supported by the two other measures of accuracy that are less likely to be
data (Tibshirani, Walther, & Hastie, 2001). The gap statistic affected by sample base rates (Dollaghan, 2007). This is
is based on summing across clusters the within-cluster sum of important in this study because there is almost certainly a
squared errors (SSE). Additional clusters will always reduce referral bias at the Mayo Clinic, a tertiary medical center to
the total SSE, but the gap statistic can be used to determine which many families come for a second opinion about
the point at which increasing the number of clusters no possible CAS. The additional accuracy measures were
longer reduces the total SSE beyond what might be expected positive and negative likelihood ratios: LR+ and LR–,
by chance. Thus, the gap statistic is not a direct hypothesis respectively. LR+ indicates how much more likely a positive
test of the number of clusters in a data set but instead helps test result is for someone with the disorder than for someone
examine the efficiency (parsimoniousness) of a particular without the disorder. LR–, on the other hand, indicates how
solution. much more likely a negative test result is for someone with
To evaluate the clustering in the context of other the disorder than for someone without the disorder. For
measurements, we compared scores across the clusters for the diagnostic tests, Dollaghan (2007) recommended LR+
following variables: gender, age, GFTA, PPVT, receptive exceeding 10 and LR– being less than 0.10.
language standard score (RLS), and expressive language As a final method for examining the discrimination
standard score (ELS). Gender was evaluated with a x2 test, ability of the DEMSS, we used a nonparametric estimate of
whereas the others were evaluated with a nonparametric the area under the receiver operating characteristic
Kruskal–Wallis test due to skewness in some measures. (AUROC) curve and 95% bootstrap confidence intervals to
quantify discrimination. The AUROC can be thought of as
Discrimination Methods the average sensitivity of a measurement as well as an
estimate of the probability of correctly classifying two
To minimize the potential confound of diagnosis and children, one with CAS and one without CAS, based only on
DEMSS administration having been completed by the same the DEMSS. We compared the discriminative ability of the
individuals, we structured the study so that the clinician’s DEMSS total score and subscores by using the method
diagnostic activities and calculation of DEMSS scores were described by DeLong, DeLong, and Clarke-Pearson (1988).
separate tasks. Weeks after all of the children had completed
the protocol and after cluster data analyses were completed,
the experimenters went back to the children’s medical Results
records. At that time, we categorized children on the basis of
whether their medical record indicated a diagnosis of (a) Reliability
CAS (characteristics of apraxia of speech as the major Results for the reliability analyses are reported in order
contributor to their SSD), (b) mCAS (phonologic impair- for test–retest reliability, intrajudge reliability, and interjudge
ment and/or residual articulation errors, with a contribution reliability. For each type of reliability, we report the mean
of at least mild difficulty with speech praxis), or (c) any other percentage of agreement for the 171 judgments as well as the
SSD. We then compared the diagnostic categorization with ICC. Figure 1 illustrates the observed agreement for the
the results of the DEMSS cluster analysis. three reliability conditions.
We used several methods to evaluate the ability of the Test–retest reliability. For test–retest reliability, the
DEMSS total score and the four subscores to discriminate mean percentage of agreements was 89%, with a range from
between participants with CAS (i.e., those with CAS or 77% to 99%. The data for these calculations were obtained
mCAS diagnoses) and those without CAS. Unlike the cluster from the 11 participants for whom test–retest reliability data
analysis described above, these methods explicitly used the were obtained and consisted of mean percentage of agree-
participants’ clinically defined CAS status, which was taken ment across the 171 judgments per participant at Time 1 and
as the reference standard in this study. When a test is used Time 2. The test–retest column of Table 4 shows the ICCs
for diagnosis, sensitivity refers to the extent to which the test for the total DEMSS score and for the subscores. These
correctly identifies those with the disorder as having the ranged from 0.82 for the consistency subscore to > 0.99 for
disorder, whereas specificity refers to the extent to which it the total DEMSS score.
correctly identifies those without the disorder as not having Intrajudge reliability. The mean intrajudge agreement
the disorder (Dollaghan, 2007). The cut point of a test or of the 171 judgments was 89%, with a range from 81% to

Strand et al.: A Motor Speech Assessment for Children 511


Figure 1. A: Relationship between first and second assessment on 95%. The data for these calculations were obtained from the
each subject. B: First and second assessment for the rater. C: The 12 participants for whom intrajudge reliability data were
two raters’ assessment. Because of the large range of values, the obtained and consisted of mean percentage of agreements
data are shown on a log-transformed scale.
across the 171 judgments per participants made by each of
two judges across two scorings. The corresponding ICCs are
shown in the intrajudge column of Table 4. Low intrajudge
ICC values for two subscores (0.31 for prosody and 0.38 for
consistency) along with associated confidence intervals that
included negative values suggest poor or even absent ICCs
for these two subscores. In contrast, the intrajudge ICCs for
the overall articulatory accuracy subscore and the total
DEMSS score were high: 0.95 for overall articulatory
accuracy and 0.92 for the total DEMSS score.
Interjudge reliability. The mean interjudge agreement
of the 171 judgments was 91%, with a range from 80% to
98%. The data for these calculations were obtained from the
20 participants for whom interjudge reliability data were
obtained and consisted of mean percentage of agreement for
the two judges across the 171 judgments per participant. The
interjudge column of Table 4 shows that the ICCs ranged
from 0.73 for the consistency subscore to 0.98 for both the
total DEMSS score and the overall articulatory accuracy
subscore.

Validity
The cluster analysis using four DEMSS subscores
(overall articulatory accuracy of the word, vowel accuracy,
prosody, and consistency) is summarized in the dendrogram
in Figure 2. Although the clustering algorithm does not take
diagnosis into account, children who had been diagnosed
with CAS or mCAS in their initial clinical evaluation appear
in the figure as boxes and triangles, respectively, to identify
to what degree children with this diagnosis are clustered
together. The vertical axis of the dendrogram represents the
dissimilarity between clusters. A horizontal line drawn at any
point on the dendrogram will divide the participants into
clusters, that is, groups of participants whose profiles on the
four subscores (rather than the DEMSS total scores) are
similar. Key clusters are labeled to guide the reader.
This dendrogram illustrates that there appear to be
three major clusters of children. Cluster A consists of 15
children: Three who had been diagnosed with CAS, five who
had been diagnosed with mild CAS, and seven who had been
diagnosed with some other SSD. Cluster B is composed of
seven participants, who had been diagnosed with CAS. As
might be expected, there is considerable variability within the
cluster (as indicated by the horizontal distances among
individuals), even though the algorithm clearly distinguishes
this group from the rest of the children. Cluster C includes all
other participants, including two children who had been
clinically diagnosed with CAS and three children who had
been diagnosed with at least mild difficulty with praxis for
speech movement (mCAS).
Although not designed for use in the framework of
hypothesis testing, the gap statistic provides a heuristic for
determining the number of clusters in a data set. The gap
statistics for the first five clusters were, in order, 0.42, 0.99,
1.02, 0.88, and 0.82. This sequence shows strong evidence

512 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013
Table 4. Intraclass correlation coefficient estimate (95% confidence interval [CI]) for three components of the
reliability analysis.

Reliability
Variable Test–retest Intrajudge Interjudge

Total DEMSS score 1.00 [.99, 1.00] .92 [.77, .98] .98 [.95, .99]
Vowel accuracy subscore .99 [.98, 1.00] .89 [.67, .97] .94 [.87, .98]
Overall articulatory accuracy subscore .99 [.98, 1.00] .95 [.84, .98] .98 [.96, .99]
Prosodic accuracy subscore .99 [.96, 1.00] .31 [–.27, .73] .91 [.78, .96]
Consistency subscore .82 [.48, .95] .38 [–.20, .77] .73 [.44, .88]

against the single-cluster solution in that there is a marked gender (p = .81) or age (p = .32). Presenting data in the order
increase in the gap statistic for the two-cluster solution, Cluster A, B, and C, respectively, we did observe significant
corresponding to an appreciable reduction in the within- differences in scores on the GFTA (Mdn = 65, 60, and 78
cluster sum of squares. The gap statistic shows the three- for Clusters A, B, and C, respectively; p = .03), PPVT
cluster solution to be optimal. Given that there is only a (Mdn = 86, 96, and 102 for Clusters A, B, and C,
marginal increase from two to three clusters, the two-cluster respectively; p < .01), RLS (Mdn = 86, 88, and 95 for
solution is the more parsimonious of the two. In this sense, Clusters A, B, and C, respectively; p < .01), and ELS (Mdn =
we have found the gap statistic to be generally consistent 81, 77, and 89 for Clusters A, B, and C, respectively; p = .03).
with the above interpretation of the dendrogram. Table 5 provides measures of how well the DEMSS
Using Clusters A, B, and C identified in the total score and subscores discriminate among children with
dendrogram, we found no differences across the clusters in and without CAS, regardless of severity. AUROC values

Figure 2. Dendrogram indicating the sequence of cluster merges beginning with each participant in his or her own cluster at bottom and ending
with all participants merged into a single cluster at top. The vertical axis represents the dissimilarity between merging clusters. A horizontal line
drawn at any point will divide the participants into clusters. We have indicated the point at which there are three clusters and have labeled them
A, B, and C. Although clinical status was not used in the cluster analysis, we have indicated participants with CAS and mCAS, and we have
identified them by their study identifier.

Strand et al.: A Motor Speech Assessment for Children 513


Table 5. Summary measures indicating how well the DEMSS total score and subscores discriminate among
children with and without CAS, regardless of severity.

AUROC Logistic Sensitivity/ Specificity at


Measure (95% CI) regression cutoffa specificityb 90% sensitivity LR+/LR–b

Total DEMSS 0.93 [0.83, 0.97] 129 0.65/0.97 0.70 19.8/0.36


Vowel accuracy 0.91 [0.79, 0.96] 20 0.60/0.98 0.62 36.6/0.41
Total accuracy 0.93 [0.82, 0.97] 99 0.65/0.98 0.66 39.6/0.36
Prosody 0.78 [0.64, 0.87] 6 0.35/0.95 0.49 7.1/0.68
Consistency 0.93 [0.82, 0.97] 11 0.70/0.93 0.74 10.7/0.32

Note. AUROC = area under the receiver operating characteristic curve; LR+ = likelihood ratio for a positive
test result; LR– = likelihood ratio for a negative test result.
a
Cutoff is defined as the minimum value such that the estimated probability of CAS from the logistic
regression model is greater than 0.5. bBased on the indicated logistic regression cutoff.

(estimates of the probability of correct classification based Reliability


only on the DEMSS scores) were above 90% for total score
Assessments of reliability are undertaken to examine
and for the vowel, accuracy, and consistency subscores but
the vulnerability of a test to error or to unintended sources of
were not significantly different among these four measures
variability. The test–retest mean agreement of 89% obtained
(p = .32; DeLong et al., 1988).
for the DEMSS in this study suggests a high level of test–
The prosody subscore AUROC of 0.78 was signifi-
retest reliability or a relatively low level of error due to
cantly lower than the other measures (p < .02 for each
differences in the child’s performance over two test admini-
comparison). The DEMSS total score and subscores are
strations. This finding was valuable given the difficulties
ordinal and, therefore, are not designed to support the
intrinsic to this type of testing—difficulties related to
identification of strict cutoffs. However, we found that based
variations in child factors, such as attention and mood, as
on the logistic regression models and by using a cutoff of an
well as variations in clinician factors, such as judgment focus
estimated probability greater than 0.5, sensitivity ranged
and interpretation of the child’s response (Kent et al., 1987).
from 0.70 for consistency down to 0.35 for prosody. Speci-
The intrajudge mean agreement of 89% suggests that the
ficity was 93% or greater for all measures. Choosing a cutoff
clinician consistently judged a child’s performance on the
with sensitivity at 90% resulted in a specificity ranging from
DEMSS with few differences across judgments, even when
0.74 for the consistency subscore to 0.49 for the prosody
the second set of judgments was made from a video
subscore. In addition to measures of specificity and sensi-
recording in which the child’s activity sometimes prevented
tivity, similar findings were demonstrated for two related
visualization of the face or introduced extraneous noise.
measures (LR+ and LR–, using the logistic regression cutoff),
Thus, clinicians were able to score this test consistently
which were calculated because of their greater independence
across live versus taped contexts, a finding suggesting that
from the specific sample of test takers (see Table 5). A score
test users can score it online or from a video record as
above the cutpoint was at least 20 times more likely for
needed, based on the demands of their clinical context. The
participants with CAS than for participants without CAS
interjudge mean agreement of 91% suggests acceptable
for the DEMSS total score, accuracy subscore, and vowel
reliability across clinicians. This finding is particularly
subscore, with LR+ values ranging from 20 to 39—values
important given the use of multidimensional scoring for two
that easily exceed the recommended LR+ > 10 (Dollaghan,
DEMSS subscores (overall articulatory accuracy and vowel
2007). However, LR– values did not reach the recommended
accuracy) because this type of scoring can pose a particular
LR– < .10 (Dollaghan, 2007).
challenge to reliability (Odekar & Hollowell, 2005). Further,
this finding is important because two of the three judges in
the study (although experienced clinicians) had not been
Discussion involved in the DEMSS’s development and had received less
The purpose of this study was to provide evidence for than 2 hr of instruction and practice prior to assuming their
the reliability and validity of a dynamic speech assessment roles in the study. Thus, the demonstration of high interjudge
tool for younger children and/or those with more severe agreement suggests the feasibility of the test’s use by
speech impairments. The data provide support for acceptable experienced clinicians following modest training in its
intrajudge, interjudge, and test–retest reliability of the administration.
instrument. Results also provide evidence supporting the As a further exploration of the DEMSS reliability for
validity of the DEMSS for the diagnostic purpose of subscores, ICCs were used to examine consistency across
differentiating children with severe speech impairment due, subscores. ICC findings suggest that the DEMSS total score
at least, in part to deficits in motor skill, especially planning as well as the four subscores were highly reliable for both the
and programming of movements for speech. test–retest and interjudge contexts. That is, differences in

514 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013
scores across children were primarily due to differences used as a substitute for the whole test in differential
among children rather than other factors. In the intrajudge diagnosis, the data also suggest that the total DEMSS is the
context only, prosody and consistency subscores were less best discriminator.
reliable. The poorer performance of these two subscores, When logistic regression modeling was used to
however, did not undermine the value of the DEMSS total determine optimal cut points on the DEMSS, measures of
score in discriminating groups of children. The smaller sensitivity and specificity as well as likelihood ratios showed
number of utterances judged for prosody (n = 21) and that the test did not overidentify CAS. However, a few
consistency (n = 28) than for either vowels (n = 56) or overall children who had been given a diagnosis of CAS (n = 2) or
articulatory accuracy subscores (n = 66) likely contributed to mCAS (n = 3) based on the comprehensive speech evaluation
the poorer ICCs. Also, the clinicians anecdotally reported were not successfully identified by the DEMSS. This is
that these judgments were among the most difficult they were probably because the DEMSS had been designed specifically
required to make, consistent with the findings of Munson, to address the significant challenges posed by children who
Byorum, and Windsor (2003), thus suggesting the need for are younger and/or exhibit more severe speech impairments.
providing ample practice on judgments of stress. We plan to Thus, some children who were less severely affected probably
provide numerous examples of judgments of stress when we performed well on the simpler stimuli incorporated in the
create a training tape for test administration. DEMSS but were given the diagnosis of CAS or mCAS
because of their performance on more phonetically complex
Validity or multisyllabic stimuli used during other testing conducted
as part of the entire evaluation. For example, participants 41
The DEMSS was designed to help clinicians identify
and 25 were not identified by the DEMSS as being CAS,
the subset of children with severe SSDs who exhibit difficulty
although the clinical examination showed significant diffi-
in motor skill, particularly motor programming difficulty.
culty with praxis. They exhibited numerous vowel distortions
The label CAS is most frequently used for these children,
in connected speech. Direct imitation using visual attention
who seem to have problems executing accurate movement
to the clinician’s face facilitated their performance on this
gestures to reach specific spatial and temporal targets for
speech (in the absence of weakness or other neuromuscular isolated word task. Groping and voicing errors (frequently
deficits). The ability of the DEMSS to distinguish children noted in children with CAS) were also noted in more difficult
with characteristics associated with CAS versus those with- tasks but are not used in the scoring of the DEMSS. This
out, within a population of children with SSDs, represents a likely contributed to the DEMSS not identifying them. This
vital contribution to evidence of its validity. issue points to the likelihood that no one task or test is
In this study, agglomerative hierarchical cluster sufficient for differential diagnosis. This finding was not
analysis was selected as the preferred method to examine the unexpected and emphasizes the fact that the DEMSS is best
validity of the DEMSS in part because it could provide used as part of a test battery for children who are younger
evidence that the test differentiated subgroups of children and/or demonstrate more severely impaired speech acquisi-
based on their patterns of performance in motor speech skill. tion. Until an alternate version of the DEMSS is designed for
The content of the DEMSS was designed not only to include older children with more speech output, existing measures
information about children’s articulatory accuracy but also may be more appropriate for children with those skills (e.g.,
to focus on vowel accuracy, prosody, and consistency of Rvachew et al., 2005; Thoonen, Maassen, Gabreels, &
error productions on repeated attempts, which have been Schreuder, 1999; Thoonen, Maassen, Wit, Gabreels, &
posited as important behavioral markers for the label of Schreuder, 1996).
CAS. The three clusters identified using the DEMSS closely This study showed that the DEMSS could identify
resembled groups that were based solely on a clinical groups with similar profiles of motor speech characteristics
diagnosis (CAS, mCAS, and other SSD), which had been (movement accuracy, vowel accuracy, consistency, and
made based on an earlier and more comprehensive set of test prosody). We have been using the term CAS given the
observations. Although the DEMSS was part of that test literature suggesting these characteristics as being supportive
protocol, the scores were not tallied until weeks later, and of the label for this subgroup of children with speech sound
they were not used in establishing the clinical diagnosis. disorders. Although theories of CAS accounting for the
Although this represents a limitation of this study, it is pathophysiology of the disorder are in early stages of
consistent with the incomplete nature of any examination of development (e.g., ASHA, 2007; Caruso & Strand, 1999;
validity. Nijland, Maassen, & van der Meulen, 2003; Shriberg,
The variability among the seven children in one cluster Lohmeier, Strand, & Jakielski, 2012; Terband & Maassen,
(Cluster B, all of whom had been diagnosed with CAS) 2010; Terband, Maassen, Guenther, & Brumberg, 2009),
illustrates the expected heterogeneity of behaviors observed there is now fairly good agreement that these characteristics
within the diagnosis. For the children participating in this should be present to warrant the use of the diagnostic label.
study, the total DEMSS score as well as the overall Although a larger number of characteristics may also be
articulatory accuracy, vowel, and consistency subscores were present (e.g., sound omissions, substitution errors, reduced
all effective in distinguishing among groups of children phonemic and phonetic inventories), they are less likely to be
within a larger group with SSDs. Although the findings discriminative than movement accuracy, vowel accuracy, and
suggest that any of those discriminating subscores might be prosodic accuracy. There are reasons to expect these

Strand et al.: A Motor Speech Assessment for Children 515


particular characteristics in difficulties with praxis (planning– in a busy clinic environment. To use it, clinicians will need to
programming volitional movement for speech; Odell & learn the scoring system and practice making judgments,
Shriberg, 2001; Shriberg et al., 2003). Although it is still especially those related to lexical stress. The clinicians
unclear in what way or to what degree the proprioceptive administering the DEMSS for this study had one 30-min
feedback mechanisms, basal ganglia, and cerebellar circuits session with the developer, followed by feedback on five
and/or motor planning areas of cortex are implicated in administrations. Using a 30-min training tape (which should
speech praxis deficits in children, deficits in these neural include practice and feedback) and practicing administering
networks appear consistent with inaccurate movement five tests with video review will likely allow clinicians to
trajectories and mistiming. If one has trouble with pro- administer the DEMSS reliably, although verification of this
gramming specific movement parameters, it would be very claim is still needed.
difficult to achieve the exact vocal tract shape for vowels and Although the purpose of this article was to report the
consistent movement patterns across repetitions. Further, validity and reliability of the DEMSS, comments on other
because lexical stress, at least in English, depends on the potential uses of this instrument in clinical decision making
ability to program subtle differences in syllable durations, warrant at least brief discussion. For example, the amount of
fundamental frequency, and signal amplitude (Kehoe, Stoel- cuing needed by the child during administration of the
Gammon, & Buder, 1995), difficulties with motor program- DEMSS offers additional information related to the severity
ming would probably result in lexical stress errors (Shriberg of the child’s problem as well as the types of cuing that may
et al., 2003). be most facilitative in treatment. Observation of specific
The analysis of clustering in the context of other vowel errors and performance on specific syllable structures
measurements seems to be consistent with expectations that can be helpful in choosing initial sets of stimuli. Performance
those children in Group C (which was largely comprised of under dynamic testing conditions such as those used in the
children without diagnoses of either CAS or mild CAS) DEMSS could be expected to lead to predictions regarding
would perform best on both speech (GFTA) and language prognosis. This, in turn, would influence suggestions
measures (PPVT, RLS, ELS). Clusters A and B did not regarding the use of signing and/or other methods of
exhibit large differences except on the PPVT, where Cluster augmentative communication to facilitate language
B, comprised entirely of children with a CAS clinical development and functional communication while work to
diagnosis, unexpectedly performed better than those in improve speech acquisition proceeds.
Cluster A, which included children with mCAS or other SSD
The initial motivation for the DEMSS was that there
as their diagnosis.
was no motor speech tool appropriate for assessing very
young children with limited output. Many young children
Clinical Significance with limited output cannot complete the demands of most
Development of the DEMSS was motivated by the standardized tests. The dynamic nature of the DEMSS,
need for a psychometrically sound motor speech examination utilizing direct imitation of simple phonetic content and
to add to our battery of measurement tools for children with syllable structure with cuing, provides a valuable tool for
SSDs to identify subgroups of children with speech praxis these younger children. Our study shows that many children,
difficulties. The results of the study also describe a clinically even those 3 years of age with very little speech output, can
significant observation that is frequently overlooked. That is, complete the DEMSS by attempting imitation of each item.
children with SSDs do not typically fall neatly into a binary Clinicians are then able to make observations of what
pattern of either having a particular impairment (e.g., happens when the child attempts the production, with and
phonologic vs. CAS) or not. Rather, children more often without cuing, allowing clinicians to judge the presence of
present with overlapping characteristics, such as seen in the specific characteristics that may be helpful in determining
overlap between diagnostic categories in Figure 2. Children differential diagnosis as well as severity and prognosis.
with motor speech impairment probably exhibit errors in
phonology given that the motoric deficit almost certainly
Conclusions
makes phonologic acquisition more difficult. Further,
various degrees of severity of impairment also result in The results of this study provide initial evidence for the
overlapping categories. The reason for a clinical tool such as validity and reliability of the DEMSS as part of a
the DEMSS is not to separate categories completely, but to comprehensive protocol for differential diagnosis of children
provide a tool to help identify those children who have at with severe SSDs. The DEMSS was designed to meet the
least some evidence for difficulty with praxis so that need for a motor speech examination to be used with young
clinicians planning their treatment can incorporate principles and/or severely affected children to determine to what degree
of motor learning. motor impairment may be impacting their difficulties with
For a tool to be useful in the clinic, it must be able to speech acquisition. We believe the data provide a foundation
be given and scored in a timely manner. Administration time supporting the validity of this instrument to discriminate a
for the DEMSS ranged from 7 min (for children with subgroup of children who likely have difficulty with praxis
predictable articulation substitution errors and little need for for speech. The data also provide some evidence for the
cuing) to 25 min for more severe children who needed more validity of these four characteristics as discriminatory for the
frequent cuing. A test of this length should be practical even diagnosis of CAS and support future research in determining

516 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013
the validity of these variables as part of a behavioral operating characteristic curves: A nonparametric approach.
phenotype for CAS. Biometrics, 44, 837–845.
Although the development of theories of CAS and Dollaghan, C. (2007). The handbook for evidence-based practice in
explication of its nature and most consistently discriminative communication disorders. Baltimore, MD: Brookes.
Downing, S. M., & Haladyna, T. M. (2006). Handbook of test
signs continue, it is important to be able to identify children
development. Mahwah, NJ: Erlbaum.
who may benefit most from treatment approaches and Duffy, J. (2005). Motor speech disorders: Substrates, differential
strategies that directly facilitate motor learning and speech diagnosis and management (2nd ed.). St. Louis, MO: Elsevier
motor control. The ASHA (2007) position statement on CAS Mosby.
was a start toward more consistent participant description so Dunn, L., & Dunn, L. (1997). Peabody Picture Vocabulary Test—III.
that researchers and consumers of research could be more Circle Pines, MN: AGS.
confident about the similarity of children studied and treated Glaspey, A., & Stoel-Gammon, C. (2007). A dynamic approach to
under the label CAS. The reliability and validity data for the phonological assessment. Advances in Speech Language
DEMSS provided in this study suggest that this test has the Pathology, 9, 286–296.
potential to further advance that clinical and research goal. Goffman, L. (2004). Assessment and classification: An integrative
model of language and motor contributions to phonological
development. In A. Kamhi & K. Pollock (Eds.), Phonological
Acknowledgments disorders in children: Clinical decision making in assessment and
intervention (pp. 51–64). Baltimore, MD: Brookes.
The first author acknowledges the support of Mayo Clinic Goldman, R., & Fristoe, M. (2000). Goldman Fristoe Test of
CTSA through National Center for Research Resources Grant Articulation—Second Edition. Circle Pines, MN: AGS.
UL1RR024150. We also thank Heather Clark and David Ridge for Guyette, T. W. (2001). Review of the Apraxia profile. In B. S. Plake
reading and commenting on earlier versions of this article. & J. C. Impara (Eds.), The fourteenth mental measurements
yearbook (pp. 57–58). Lincoln, NE: Buros Institute of Mental
Measurements.
References Hastie, T., Tibshirani, R., & Friedman, J. H. (2003). The elements of
American Speech-Language-Hearing Association. (2007). Childhood statistical learning. New York, NY: Springer-Verlag.
apraxia of speech: Nomenclature, definition, roles and Hayden, D., & Square, P. (1999). Verbal Motor Production
responsibilities, and a call for action [Position statement]. Assessment for Children. San Antonio, TX: Pearson, PsychCorp.
Rockville, MD: Author. Johnson, A. F., Shelton, R. L., & Arndt, W. B. (1982). A technique
Arndt, W. B., Shelton, R. L., Johnson, A. F., & Furr, M. L. (1977). for identifying the subgroup membership of certain
Identification and description of homogeneous subgroups within misarticulating children. Journal of Speech and Hearing
a sample of misarticulating children. Journal of Speech and Research, 25, 162–166.
Hearing Research, 20, 263–292. Kehoe, M., Stoel-Gammon, C., & Buder, E. H. (1995). Acoustic
Bain, B. A. (1994). A framework for dynamic assessment in correlates of stress in young children’s speech. Journal of Speech
phonology: Stimulability revisited. Clinics in Communication and Hearing Research, 38, 338–350.
Disorders, 4, 12–22. Kent, R. D. (2004). Models of speech motor control: Implications
Ball, L. J., Bernthal, J. E., & Beukelman, D. R. (2002). Profiling from recent developments in neurophysiological and
communication characteristics of children with developmental neurobehavioral science. In B. Maassen, R. D. Kent, H. F. M.
apraxia of speech. Journal of Medical Speech-Language Peters, P. H. M. van Lieshout, & W. Hultsijn (Eds.), Speech
Pathology, 10, 221–229. motor control in normal and disordered speech (pp. 3–28).
Campbell, T. (2003). Childhood apraxia of speech: Clinical London, England: Oxford University Press.
symptoms and speech characteristics. In L. D. Shriberg & T. F. Kent, R. D., Kent, J., & Rosenbek, J. (1987). Maximal performance
Campbell (Eds.), Proceedings of the 2002 Childhood Apraxia of tests of speech production. Journal of Speech and Hearing
Speech Symposium (pp. 37–40). Carlsbad, CA: The Hendrix Disorders, 52, 367–387.
Foundation. Lidz, C. S., & Peña, E. D. (1996). Dynamic assessment: The model,
Carrow-Woolfolk, E. (1997). Oral & Written Language Scales. Circle its relevance as a non-biased approach and its application to
Pines, MN: AGS. Latino American preschool children. Language, Speech, and
Caruso, A., & Strand, E. (1999). Motor speech disorders in children: Hearing Services in Schools, 27, 367–372.
Definitions, background and a theoretical framework. In A. McCauley, R. J. (2001). Assessment of language disorders in children.
Caruso & E.A. Strand, (Eds.), Clinical management of motor Mahwah, NJ: Erlbaum.
speech disorders in children (pp. 1–27). New York, NY: Thieme. McCauley, R. J. (2003). Review of Screening Test for
Conti-Ramsden, G., & Botting, N. (1999). Classification of children Developmental Apraxia of Speech (2nd ed.). In B. Plake, J. C.
with specific language impairment: Longitudinal considerations. Impara, & R. A. Spies (Eds.), The fifteenth mental measurements
Journal of Speech, Language, and Hearing Research, 42, yearbook (pp. 786–789). Austin, TX: Pro-Ed.
1195–1204. McCauley, R. J. & Strand, E. A. (2008). A review of standardized
Crary, M. A. (1993). Developmental motor speech disorders. San tests of nonverbal oral and speech motor performance in
Diego, CA: Singular. children. American Journal of Speech-Language Pathology, 17,
Davis, B. L., Jakielski, K. J., & Marquardt, T. P. (1998). 1–11.
Developmental apraxia of speech: Determiners of differential McLeod, S., Harrison, L. J., & McCormack, J. (2012). The
diagnosis. Clinical Linguistics & Phonetics, 12, 25–45. intelligibility in context scale: Validity and reliability of a
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). subjective rating measure. Journal of Speech, Language, and
Comparing the areas under two or more correlated receiver Hearing Research, 55, 648–656.

Strand et al.: A Motor Speech Assessment for Children 517


McNeil, M. R., Robin, D. A., & Schmidt, R. A. (2009). Apraxia of R. Kent, H. Peters, P. van Lieshout, & W. Hulstijn (Eds.),
speech: Definition, differential, and treatment. In M. McNeil Speech motor control in normal and disordered speech (pp. 225–
(Ed.), Clinical management of sensorimotor speech disorders 252). London, England: Oxford University Press.
(2nd ed., pp. 249–268). New York, NY: Thieme. Strand, E. A. (1992). The integration of motor-speech processes and
Munson, B., Byorum, E., & Windsor, J. (2003). Acoustic and language formulation in process models of language acquisition.
perceptual correlates of stress in nonwords produced by children In R. S. Chapman (Ed.), Child talk: Processes in language
with suspected developmental apraxia of speech and children acquisition and disorder (pp. 86–107). St. Louis, MO: Mosby.
with phonological disorder. Journal of Speech, Language, and Strand, E. A. (2003). Childhood apraxia of speech: Suggested
Hearing Research, 46, 189–202. diagnostic markers for the younger child. In L. D. Shriberg & T.
Nijland, L., Maassen, B., & van der Meulen, S. (2003). Evidence of F. Campbell (Eds.), Proceedings of the 2002 Childhood Apraxia
motor programming deficits in children with DAS. Journal of of Speech Symposium (pp. 75–79). Carlsbad, CA: The Hendrix
Speech, Language, and Hearing Research, 46, 437–450. Foundation.
Odekar, A., & Hollowell, B. (2005). Comparison of alternatives to Strand, E. A., & McCauley, R. J. (1999). Assessment procedures for
multidimensional scoring in the assessment of language treatment planning in children with phonologic and motor
comprehension in aphasia. American Journal of Speech- speech disorders. In A. Caruso & E. A. Strand (Eds.), Clinical
Language Pathology, 14, 337–345. management of motor speech disorders of children (pp. 73–108).
Odell, K. H., & Shriberg, L. D. (2001). Prosody-voice characteristics New York, NY: Thieme.
of children and adults with apraxia of speech. Clinical Linguistics Streiner, D. L., & Norman, G. R. (2003). Health measurement scales:
& Phonetics, 15, 275–307. A practical guide to their development and use (3rd ed.). New
Ozanne, A. (1995). The search for developmental verbal dyspraxia. York, NY: Oxford University Press.
In B. Dodd (Ed.), Differential diagnosis and treatment of children Terband, H., & Maassen, B. (2010). Speech motor development in
with speech disorder (pp. 91–101). San Diego, CA: Singular. childhood apraxia of speech: Generating testable hypotheses by
Peña, E. R., Gillam, R. B., Malek, M., Ruez-Felter, R., Recendez, neurocomputational modeling. Folia Phoniatrica et Logopaedica,
M., & Sabel, T. (2006). Dynamic assessment of school-age 62, 134–142.
children’s narrative ability: An investigation of reliability and
Terband, H., Maassen, B., Guenther, F., & Brumberg, J. (2009).
validity. Journal of Speech, Language, and Hearing Research, 49,
Computational neural modeling of speech motor control in
1037–1057.
childhood apraxia of speech (CAS). Journal of Speech,
Peter, B., & Stoel-Gammon, C. (2008). Central timing deficits in
Language, and Hearing Research, 52, 1595–1609.
subtypes of primary speech disorders. Clinical Linguistics &
Thoonen, G., Maassen, B., Gabreels, F., & Schreuder, R. (1999).
Phonetics, 22, 171–198.
Validity of maximum performance tasks to diagnose motor
Pinheiro, J. C., & Bates, D. M. (2002). Mixed effects models in S and
speech disorders in children. Clinical Linguistics & Phonetics, 13,
S-Plus. New York, NY: Springer.
1–23.
Rvachew, S., Hodge, M., & Ohberg, A. (2005). Obtaining and
interpreting maximum performance tasks from children: A Thoonen, G., Maassen, B., Wit, J., Gabreels, F., & Schreuder, R.
tutorial. Journal of Speech-Language Pathology and Audiology, (1996). The integrated use of maximum performance tasks in
29, 146–157. differential diagnostic evaluations among children with motor
Shriberg, L. (2003). Diagnostic markers for child speech-sound speech disorders. Clinical Linguistics & Phonetics, 10, 311–336.
disorders: Introductory comments. Clinical Linguistics & Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the
Phonetics, 17, 501–505. number of clusters in a data set via the gap statistic. Journal of
Shriberg, L. D., Aram, D. M., & Kwiatkowski, J. (1997). the Royal Statistical Society: Series B (Statistical Methodology),
Developmental apraxia of speech: II. Toward a diagnostic 63, 411–423.
marker. Journal of Speech, Language, and Hearing Research, 40, Verté, S., Geurts, H. M., Roeyers, H., Rosseel, Y., Oosterlaan, J., &
286–312. Sergeant, J. A. (2006). Can the Children’s Communication
Shriberg, L. D., Campbell, T. F., Karlsson, H. B., Brown, R. L., Checklist differentiate autism spectrum subtypes? Autism, 10,
McSweeny, J. L., & Nadler, C. J. (2003). A diagnostic marker 266–287.
for childhood apraxia of speech: The lexical stress ratio. Clinical Wetherby, A. M., Allen, L., Cleary, J., Kublin, K., & Goldstein, H.
Linguistics & Phonetics, 17, 549–574. (2002). Validity and reliability of the Communication and
Shriberg, L. D., Lohmeier, H. L., Strand, E. A., & Jakielski, K. J. Symbolic Behavior Scales Developmental Profile with very
(2012). Encoding, memory, and transcoding deficits in childhood young children. Journal of Speech, Language, and Hearing
apraxia of speech. Clinical Linguistics & Phonetics, 26, 445–482. Research, 45, 1202–1218.
Skahan, S. M., Watson, M., & Lof, G. L. (2007). Speech-language Yorkston, K., Beukelman, D., Strand, E. A., & Hackel, M. (2010).
pathologists’ assessment practices for children with suspected Management of motor speech disorders in children and adults.
speech sound disorders: Results of a national survey. American Austin, TX: Pro-Ed.
Journal of Speech-Language Pathology, 16, 246–249. Zimmerman, L. L., Steiner, V. G., & Pond, R. E. (2002). Preschool
Smith, A., & Goffman, L. (2004). Interaction of motor and linguistic Language Scales (4th ed.). San Antonio, TX: Pearson
factors in the development of speech production. In B. Maassen, PsychCorp.

518 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013
Appendix
DEMSS Content Scoring
General Instructions
N Make sure the child’s face is in good view on the camera. Check the audio levels. Make sure there is little background noise.
N Draw the child’s attention to your face for every item.
N Use a moderate to slightly slow, but natural, rate, and natural prosody.
N Use reinforcers that keep their attention to your face and are quick.
N Score online—so that you will be sure to do the appropriate cuing for the third and fourth attempts. Then, if necessary, go
back and score using videotape.
N Give up to four trials.
N Use an X only if there was no attempt.
Rules for Dynamic Administration and Scoring

Overall Articulatory Accuracy Scoring


0 = Immediate correct repetition.
1 = Immediate (no groping; trial and error); accurate rate and movement but consistent error.
N Must have normal vowel
N Substitution error that is developmental, consistent, and predictable (e.g., f/th; w/r; w/l; lateral lisp). This does not include a
final consonant deletion (or any sound deletion), which should be cued.
N Score children with phoneme collapse (use of one or two sounds to represent most substitutions) with a 1, since it
represents a consistent error. Of course, if the child also exhibits vowel distortion, groping, or other errors, it will be a 2, 3, or
4.
2 = Incorrect first try. Give a 2 if the child is able to produce the word after one additional model of the item (clinician draws
attention to the therapist’s face and provides the model a little more slowly). Also, give a 2 if the child gets the correct
response on the first try but with more time or more groping and/or trial-and-error behavior.
3 = Needs cuing such as slow, simultaneous production and/or gestural or tactile cues (correct within three or four additional
trials). Do not go beyond four cued trials.
4 = No correct response with repeated attempts.
X = Refusal/inattention/no attempt.
N Score accuracy on first try only if correct (0). If correct, go on to the second repetition in order to score consistency. If first
production is correct but second is incorrect, keep the accuracy score as a 0 but give a 1 for consistency.
N If the child is not accurate on the first attempt, score accuracy after doing the described cuing procedure.
N Let the child use his or her own strategies for self-cuing during first attempt.
N If the child self-corrects on the first attempt, without intervening cues from the clinician, score the self-correction.
N If you suspect they said the wrong target word (e.g., you say ma—they think you said mom), then start again.
N Final plosives do not have to be aspirated. Aspiration or not is OK.

Vowel Scoring—Score Only on First Attempt


0 = Immediate correct repetition of the vowel in that coarticulatory context
1 = Mild distortion (e.g., neutralization; slightly ‘‘off’’ vocal track shape; if you think to yourself, ‘‘Was that OK?’’ give it a 1)
2 = Frank distortion
N Always score the vowel on the first trial.
N If a word has two or more vowels (e.g., banana), score the vowel in the stressed syllable.
N If the child self-corrects on the first attempt, without intervening cues from the clinician, score the self-correction.

Prosody Scoring (Lexical Stress)—Score on First Attempt


0 = Correct prosody
1 = Incorrect prosody
N Score prosody using binary scoring. Use a 0 for correct prosody and a 1 for incorrect prosody (lexical stress errors such as
segmentation, equal stress, or wrong stress).
N Use appropriate lexical stress as a model, especially for the sections where prosody is scored (e.g., don’t segment syllables
or use equal stress in an effort to elicit better accuracy). The stressed syllable is in bold.

Strand et al.: A Motor Speech Assessment for Children 519


Consistency Scoring
0 = Consistent
1 = Inconsistent
N When scoring consistency, have the child repeat the utterance in succession, without intervening test items. If any two
responses during cuing are different, then score inconsistent.
N Score consistency using binary scoring. Use a 0 for ‘‘consistent’’ and a 1 for ‘‘inconsistent.’’

520 Journal of Speech, Language, and Hearing Research N Vol. 56 N 505–520 N April 2013

You might also like