Rev Sistematica Fono Visrtual

Available online at www.sciencedirect.
com
ScienceDirect
Computer Speech and Language 37 (2016) 98–128
Review
Systematic review of virtual speech therapists for speech disorders夽

Yi-Ping Phoebe Chen a,∗ , Caddi Johnson b , Pooia Lalbakhsh a , Terry Caelli c ,
Guang Deng c , David Tay c , Shane Erickson b , Philip Broadbridge d , Amr El Refaie b ,
Wendy Doube e , Meg E. Morris b
a Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC 3086, Australia
b School of Allied Health, La Trobe University, Melbourne, VIC 3086, Australia
c Department of Engineering, La Trobe University, Melbourne, VIC 3086, Australia
d Department of Mathematics and Statistics, La Trobe University, Melbourne, VIC 3086, Australia
e Faculty of Health, Arts, and Design, Swinburne University of Technology, Melbourne, VIC 3122, Australia
Received 8 September 2014; received in revised form 13 August 2015; accepted 21 August 2015
Available online 4 November 2015
Abstract
In this paper, a systematic review of relevant published studies on computer-based speech therapy systems or virtual speech
therapists (VSTs) for people with speech disorders is presented. We structured this work based on the PRISMA framework. The
advancements in speech technology and the increased number of successful real-world projects in this area point to a thriving
market for VSTs in the near future; however, there is no standard roadmap to pinpoint how these systems should be designed,
implemented, customized, and evaluated with respect to the various speech disorders. The focus of this systematic review is on
articulation and phonological impairments. This systematic review addresses three research questions: what types of articulation
and phonological disorders do VSTs address, how effective are virtual speech therapists, and what technological elements have been
utilized in VST projects. The reviewed papers were sourced from comprehensive digital libraries, and were published in English
between 2004 and 2014. All the selected studies involve computer-based intervention in the form of a VST regarding articulation or
phonological impairments, followed by qualitative and/or quantitative assessments. To generate this review, we encountered several
challenges. Studies were heterogeneous in terms of disorders, type and frequency of therapy, sample size, level of functionality,
etc. Thus, overall conclusions were difficult to draw. Commonly, publications with rigorous study designs did not describe the
technical elements used in their VST, and publications that did describe technical elements had poor study designs. Despite this
heterogeneity, the selected studies reported the effectiveness of computers as a more engaging type of intervention with more tools
to enrich the intervention programs, particularly when it comes to children; however, it was emphasized that virtual therapists should
not drive the intervention but must be used as a medium to deliver the intervention planned by speech-language pathologists. Based
on the reviewed papers, VSTs are significantly effective in training people with a variety of speech disorders; however, it cannot be
claimed that a consensus exists in the superiority of VSTs over speech-language pathologists regarding rehabilitation outcomes. Our
review shows that hearing-impaired cases were the most frequently addressed disorder in the reviewed studies. Automatic speech
recognition, speech corpus, and speech synthesizers were the most popular technologies used in the VSTs.
© 2015 Elsevier Ltd. All rights reserved.
Keywords: Virtual speech therapist; Computer-based speech therapy; Speech and language disorders; Computer-based intervention
夽 This paper has been recommended for acceptance by K. Kirchhoff.

∗ Corresponding author. Tel.: +61 3 94796768.
E-mail address: phoebe.chen@latrobe.edu.au (Y.-P.P. Chen).
http://dx.doi.org/10.1016/j.csl.2015.08.005
0885-2308/© 2015 Elsevier Ltd. All rights reserved.
Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128 99
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2. Virtual speech therapist characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3. Previous literature reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1. Eligibility criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1.1. Type of studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1.2. Types of participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.1.3. Types of intervention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101
4.1.4. Types of outcome measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2. Information sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3. Search terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4. Study selection and data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5. Data items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1. Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2. Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3. Type of studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4. Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5. Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.1. Training stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.2. Duration and frequency of therapy sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6. Outcome measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .120
5.7. VST technological building blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.7.1. Automatic speech recognition (ASR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.7.2. Facial feature tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7.3. Speech synthesizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.7.4. Expert systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.7.5. Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.7.6. Other technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.8. Therapy delivery approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.8.1. 3D virtual heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.8.2. Computer games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.9. Support features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.9.1. Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.9.2. Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.9.3. Personalized interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6. Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.1. Disorders addressed by the VSTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2. Effectiveness of VSTs in therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3. Technological elements of VSTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
1. Introduction
Humans are social creatures and communication via speech enables humans to interact and share thoughts in a way
which is not possible for any other species. Speech impairment has been shown to have an adverse impact on learning,
literacy, applying knowledge, developing and maintaining relationships with friends and family, and securing and
keeping a job (McCormack et al., 2009). Any disorder in speech will degrade a person’s role in society, dissuading them
from interacting in social activities in a way that exploits their potential. This may lead to other social anxiety disorders
and avoidance behavior (Beilby et al., 2012; Hawley et al., 2013; McLeod et al., 2013). Considering the wide range of
speech impairments, the prevalence of people with such disorders, and the related undesirable consequences on society,
100 Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128
the importance of appropriate and comprehensive rehabilitation programs is evident (Bhattacharyya, 2014). An inquiry
published by the Senate Standing Committees on Community Affairs of Australia in 2014 (Report, 2014), for instance,
reported an estimation of more than 1.1 million Australians with a communication disorder which is around 5% of the
Australia population. Taking advantage of the advances in both software and hardware in computer systems and also
the micro-miniaturization and mobility, computer-based speech therapy (CBST) environments, referred to as virtual
speech therapists (VSTs), are becoming increasingly popular (Kagohara et al., 2013). Compared with traditional speech
therapy, these environments are gaining increasing acceptance because of their versatility, availability, portability, and
controllability (Abad et al., 2013; van Vuuren and Cherney, 2014). They can provide a solution to the shortage of
speech-language pathologists (SLPs) in schools or in regional areas by reducing the number of face-to-face therapy
sessions, resulting in more affordable services as well. They are also capable of delivering impartial judgements as
feedback and provide useful automatic profiling materials (Henshaw and Ferguson, 2013; van Vuuren and Cherney,
2014).
To the best of our knowledge, this is the first systematic review on virtual speech therapists (VSTs). It contains
studies that state the intention to test the effects of VST programs on articulation and phonological disorders due
to conditions such as dyslalia, aphasia, dysarthria, childhood speech-sound disorder, residual articulation errors (e.g.
lisping), or hearing impairment; however, we did not consider other disorders that may affect people’s voice and
speech such as Alzheimer’s disease (Mesulam et al., 2014), autism (Khowaja and Salim, 2013) and Down’s syndrome
(Laws and Hall, 2014) to exclude individuals with concomitant cognitive impairment that can affect the success rate
of articulation and/or phonological therapies and potential ability to engage with VSTs versus face-to-face therapy. To
select the related studies, extract the information and present the results, we followed the PRISMA Statement (Liberati
et al., 2009) focusing on three research questions: (1) what types of articulation and phonological disorders have VSTs
addressed, (2) how effective was the VSTs in the therapy, and (3) what technological elements have been utilized in
VST projects.
In the next section, we propose a description for VST and several features that a VST should support. This definition
will be our reference for the term VST throughout the paper. Section 3 deals with previous reviews on VST. Section 4
describes our method to prepare this systematic review. We study the extracted results in detail in Section 5. An overall
discussion on the results is presented in Section 6, and finally Section 7 concludes the paper.
2. Virtual speech therapist characteristics
Considering different intervention frameworks, an overall regime for the majority of speech-communication therapy
starts with the assessment of the patient’s strengths and weaknesses. The SLP designs an appropriate therapy program
based on the profile of the person under therapy, the disorder, and short and long-term targets. The outcomes of the
therapy sessions are evaluated using specific materials and tables (Chien et al., 2014). To the best of our knowledge, no
computer-assisted speech therapy system has been developed to deliver the whole cycle of the therapy. The cognitive
pattern of SLPs driving and managing the therapy, based on their knowledge and experience, is too complicated to
be completely coded into the algorithmic nature of software applications (Wren et al., 2010). However, none of the
studied VSTs claim to be an option to replace SLPs in the whole therapy.
We did not find a complete and comprehensive definition for VST in the literature on speech pathology and computer
science. On one hand, there were many studies on speech pathology using computer-based intervention in which
computers were not involved to the degree where they could be referred to as VSTs, and on the other hand, we found
many published studies on disordered speech processing, where no formal intervention frameworks were considered.
To specify a domain for this systematic review and define a discriminative concept for selecting publications from the
existing huge number of studies, we defined VST as:
Deﬁnition 1. An interactive computer program that targets a specific speech deficit1 based on a predefined therapy
program, but not a program designed to facilitate improvement in speech or language skills in learning a non-native
language, in the absence of a diagnosed speech or language deficit, or to tutor individuals whose abilities are already
within the normal range.
1 For this systematic review, the speech disorders are limited to articulation and phonological disorders and speech-language problems in hearing-
impaired people.
In the above definition, a computer refers to any electronic device which uses a processor to run a stored program
in its memory; therefore, the platform can be a personal computer (PC) in the form of desktops or laptops, personal
digital assistants (PDAs), tablets, or mobile phones. All 20 studies reported PCs as the platform, while two studies
reported VSTs that could be run on PDAs as well (Danubianu et al., 2009; Schipor et al., 2010).
3. Previous literature reviews
We carried out a search to find any types of reviews on virtual speech therapy and to the best of our knowledge, this
paper is the first systematic review on VSTs. We found several reviews in other related fields which address computer-
assisted language learning (Golonka et al., 2012; Henshaw and Ferguson, 2013; Lidström and Hemmingsson, 2014)
which may be useful for the reader; however, we do not include them in this paper since the focus is on articulatory
and phonological skills in speech pathology.
4. Method
To carry out a standard and transparent systematic review, a process based on the PRISMA Statement was followed
(Liberati et al., 2009). Studies were eligible if a computer-based therapy was utilized in a speech therapy for articulation
and phonological disorders. They were also required to present quantitative or qualitative assessments and have the
potential to cover the three research questions:
1. What types of articulation and phonological disorders have VSTs addressed?

2. How effective were the VSTs in the therapy?
3. What technological elements have been utilized in VST projects?
The eligibility criteria are detailed in the following subsection.
4.1. Eligibility criteria
4.1.1. Type of studies

We considered various types of studies, such as meta-analysis, systematic reviews (Liberati et al., 2009), control
trial and randomized control trial studies (Schulz et al., 2010), cohort studies (Furberg and Friedman, 2012) and case
series and case reports (Berg and Lune, 2011) which study the efficiency of VSTs in the rehabilitation of articulation
and phonological disorders. A definition of the study types can be found in the references provided. Papers were
restricted to articles written in English and published between 2004 and 2014 (research carried out over the past ten
years) as computer-based technology is a fast-developing field, and earlier studies become obsolete or superseded. The
last search was run on July 28th 2014.
4.1.2. Types of participants

Participants of any age and any sex with articulation and phonological impairments due to the conditions such
as dyslalia, aphasia, dysarthria, childhood speech-sound disorder, residual articulation errors (e.g. lisping), specific
language impairment (SLI), or hearing impairment were taken into account. Studies including disorders such as autism
(Khowaja and Salim, 2013), Alzheimer‘s disease (Mesulam et al., 2014), and Down’s syndrome (Laws and Hall, 2014)
were excluded since the concomitant cognitive impairment in individuals with these impairments can affect the success
rate of the VST therapy because they may not be able to potentially engage with VSTs compared to the face-to-face
therapy method.
4.1.3. Types of intervention

This systematic review was limited to studies which analyze the effects of using VSTs in therapy sessions to improve
articulation and phonological skills underlying speech production and/or speech comprehension. The adopted VST
should match the VST definition presented in section 2.
4.1.4. Types of outcome measures

This systematic review covers studies on both speech-language deficits and hearing-impaired cases; therefore, in
quantitative studies, both speech production and speech comprehension measures are taken into account, such as the
Goldman-Fristoe Test of Articulation (GFTA) (Goldman and Fristoe, 2000), percentage of constants correct (PCC)
(Shriberg et al., 1997), correctness of pronunciation, task completion performance, word discrimination test (WDT)
phonological assessment battery (PhAB) (Frederickson et al., 1997), phonological awareness (Gillon, 2004), hearing
in noise test, sound pressure level, word recognition accuracy (WRA), BKB sentence test (Bench et al., 1979), average
sentence level word accuracy, word naming score (WNS), and the word verification rate (WVR). In qualitative studies,
the outcome measures are the interviews and/or questionnaires which were designed to address the research questions.
4.2. Information sources
The studies were identified by searching electronic databases, scanning reference lists of articles and engaging in
consultation with experts in the field of information technology and speech therapy. No limits were applied to the
languages the proposed VSTs were designed for. This search was applied to Medline, PubMed,2 ProQuest Central,3
Web of Science,4 Allied and Contemporary Medicine (AMED),5 Informa Healthcare,6 Wiley Digital Library,7 Tay-
lor & Francis,8 Springer,9 ScienceDirect,10 IEEEXplore,11 and ACM Digital Library12 electronic databases. The
SpeechBite13 database was also searched. Finally, we tried GoogleScholar14 as an integrated and comprehensive
academic search engine.
4.3. Search terms
The following search terms were used to search all the databases: speech disorder*; language disorder*; dysarthr*;
dyslalia*, articulat* disorder*; phonological disorder*; stutter*; apraxia; hearing impair*; hearing disorder*; deaf*;
hard of hearing; aphasi*; voice disorder*; childhood apraxia of speech; virtual speech therap*; computer assisted
therap*; computer assisted instruction; computer-based intervention; computer aided therapy; multimedia and; e-
learning. Boolean operators were used to combine the terms as: compu* and speech, compu* and phonology*,
compu* and articula*, compu* and virtual therap*, speech disorder* and treatm*, compu* and (speech or voice or
language) and (disorder* or deficit or impair*).
4.4. Study selection and data collection
919 shortlisted research studies were retrieved from individual databases through searches. A hand search of ref-
erence lists retrieved another 62 studies. After duplicates were removed, there were 782 studies. These studies were
then filtered in three stages to address our research questions and meet the eligibility criteria. A review of the titles and
keywords resulted in 219 studies. Titles which referred to disorders other than articulation and phonological disorders,
and titles referring to second language learning and/or science tutors were omitted from the list. A review of the
abstracts by the two review authors resulted in 46 studies that met the inclusion criteria. In this stage, studies with titles
or keywords relating to articulation and phonological impairments but referring to the effects of other disorders such as
2 http://www.ncbi.nlm.nih.gov/pubmed.
3 http://www.proquest.com.
4 http://wokinfo.com.
5 https://www.ebscohost.com/academic/AMED-The-Allied-and-Complementary-Medicine-Database.
6 https://www.informahealthcarestore.com/journals.html.
7 http://onlinelibrary.wiley.com.
8 http://www.taylorandfrancis.com/.
9 http://www.springer.com/gp/.
10 http://www.sciencedirect.com.
11 http://ieeexplore.ieee.org/Xplore/home.jsp.
12 http://dl.acm.org/.
13 http://speechbite.com/.
14 https://scholar.google.com.
Fig. 1. Flow chart showing different phases of material identification and selection.
depression, Alzheimer’s disease, etc. on speech were omitted. We also removed studies without an assessment of the
adopted VST and studies which used VSTs regardless of therapeutic objectives. The full texts of the 46 articles were
then reviewed. These papers were filtered based on the VST definition mentioned in section 2, the information pre-
sented on the conducted experiments, and the reported technical description of their VSTs. There were disagreements
between the two authors in this stage, which were resolved by discussion with the co-authors. The disagreements were
mostly over matching the developed VST with Definition 1 (see Section 2), and the amount of technical information
the studies presented on their VSTs. Fig. 1 shows all the phases of the identification and selection of the materials
finally included in this systematic review. As shown in Fig. 1, 20 publications were finally selected.
4.5. Data items
We extracted the required information on: (1) study type, (2) characteristics of the participants (age, sex, the type
of impairment(s) the participants had), (3) specifications of the therapy sessions (number of sessions, length of each
session, how often the sessions were conducted), (4) intervention characteristics, (5) type of outcome measures and
(6) outcomes. We organized this information in a single table (see Table 1) for the selected studies. In relation to the
technical aspects of VSTs, the name of the adopted system, the environmental description and technological details
were extracted as CBST name, CBST details and CBST technology, respectively. This technical information was
included in another table (see Table 2).
5. Results
As mentioned in the previous section, a total number of 20 articles on using VSTs in speech-language therapy
between 2004 and 2014 were shortlisted. Fig. 2 details the publication trend of the shortlisted papers. In relation to the
digital libraries, four of the selected publications were retrieved from the PubMed digital library (Massaro and Light,
2004; Segers and Verhoeven, 2004; Silva et al., 2012; Thompson et al., 2010), followed by three publications from
ScienceDirect (Abad et al., 2013; Moore et al., 2005; Saz et al., 2009). Two publications from Springer (George and
104
Table 1
Intervention characteristics considered in shortlisted publications.
Ref. Study type Intervention Sessions Sample Participant Outcome measures Outcomes
Size Characteristics
Segers and Control trial To determine whether 15 min weekly 36 (31M, 5F): 4–6 Pre- and post-test of Results showed positive
Verhoeven (pre-test and kindergarten children sessions for 5 year old children phonological awareness task effects of using computer
(2004) post-test) with specific language weeks with specific scores phonological awareness
impairment (SLI) could language Description: Five intervention on kindergarten
develop phonological impairment (SLI). phonological awareness tests children with SLI. The
awareness skills through (rhyme awareness task, word authors considered a pre-test
computer intervention. awareness task, phoneme and two post-test groups.
Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128

analysis task, syllabic Both post-test groups
awareness task and phoneme benefited from VST with
synthesis task) and the different settings. Statistical
Colored Progressive significance difference tests
Matrices. showed significance
difference between both
post-test groups and the
pre-test, while the difference
between post-test1 and
post-test2 was not significant.
Massaro and Cohort To investigate the efficacy 6 h across 21 7 (2M, 5F): 8–13 Pre- and post-test of the For both phases of the
Light (pre-test, of a computer animated weeks year old children proportion of correct tries for experiment (speech
(2004) post-test and talking head and a virtual with hearing loss two test phases of speech perception and speech
follow-up) language tutor for speech perception and speech production) results showed
perception and production. statistically significant
production. Training was Description: 104 words improvements from pre-test
conducted on voice vs. including all of the training to post-test.
voiceless distinctions, segments in all contexts.
consonant cluster
distinctions, and fricative
vs. affricative distinction.
Moore et al. Control trial The effect of phonemic Twelve 30 min 30 (13M, 17F): 8–10 Pre- and post-test of word Phonological Assessment
(2005) (pre-test and contrast discrimination sessions year old normal discrimination test (WDT) Battery (PhAB), showed
post-test and training on the across 4 weeks students. and Phonological Assessment statistically significant
follow up) discrimination of whole Battery (PhAB) for four improvement of performance
words and on subsets of alliteration, rhyme, post training and follow up on
phonological awareness spoonerism and non-word all four subtests (alliteration,
using a language training reading. rhyme, spoonerism and
computer game compared non-word reading). The word
to no intervention. discrimination test (WDT)
showed significant
improvement in the trained
group following correction
for multiple comparisons.
Eriksson et al. Qualitative: Therapists and the A single 1 h 17 9 children (7M, Evaluating the usability of Key recommendations about
(2005) semi- children were exposed to interview 2F) and 10 speech computer-based speech user profiles, exercises,
structured a training session session therapists: 8–15 training. motivation, reward, visual
interviews provided through the year old children Description: The participants models, cues and feedback.
VST, teaching them how with hearing or are asked about the current
to produce correct speech speech system and strategies they use
in different transparent impairments. and their requirements that
layers. should be embedded in a
designed VST.
Engwall et al. Qualitative: They try to determine if A 10 min test 7 6 male children Evaluating the usability of The participants were very
(2006) semi- the learners are able to session and a and 1 female ARTUR user interface. positive about ARTUR. They

structured change their articulation 15 min adult: 3 × 9–14 Description: the authors made some recommendations for
interviews according to feedback interview years (extensive organized an interview increasing the number of sounds
given by a virtual speech session experience of session to assess functionality supported by the program and
therapist. speech therapy of the designed buttons, some modifications in delivering
and CBSTs); 3 × 6 animations, feedback, articulation therapy. Some
years (experience engagement and adaptability evidence found that the learners
of speech therapy of the ARTUR could change their articulation
but not CBSTs). 1 human–computer interface. according to the articulatory
adult second feedback instructions. The sub
language leaner. experiment on one native
All children were speaker found that different
recognized to have phonemes can be discriminated
specific language in the articulatory space, using
impairment (SLI). the acoustic signal as input.
Cole et al. Single case Investigation of the 7 VST 1 4 adults (56–83) Qualitative evaluation: The testers were uniformly
(2007) design effects of Lee Silverman treatment as testers to fill in Evaluating the usability of the positive in rating the VST. The
(pre-test and Voice Treatment (LSVT) sessions and a the questionnaires. Marnie user interface through user was quite positive in the
post-test and virtual therapy system on single 1 h 1 female: 74 year a questionnaire in qualitative interview about the VST.
interview) SPL data. interview old with evaluation and an interview The person under therapy
session dysarthria, 8 years session with the person under improved mean score for
post-diagnosis, no therapy after the VST therapy measures of sustained/ah/,
previous computer session. monolog and reading. The
experience. Five Quantitative evaluation: Pre- results were consistent with
individuals with and post-test of Sound previously published LSVT
Parkinson disease Pressure Level in dB: Sound efficacy data and therefore, did
who had Pressure Level in three not affect treatment adversely. In
previously situations of sustained \ah\, the interview, the user’s voice
completed LSVT monolog, and reading. was loud and clear at all times.
also participated The user described feeling a
in the project as change in her voice and enjoyed
consultants. the interaction with the CBST
named Marnie.
105
106
Table 1 (Continued)
Wren and Randomized Comparison between Weekly 30 min 33 (25M, 7F): 4–8 Pre- and post-test of GFTA No statistically significant
Roulstone control trial computer and tabletop sessions year old children, sounds in words; percent of differences between
(2008) (pre- and post- delivery of phonology across 8 weeks English as a first consonants correct (PCC); computer, tabletop and no
and follow up) therapy. language; normal speech processing tasks; treatment groups were found.
hearing; attention level rating scale; This was explained by the
phonological phoneme stimulability. intention to keep the therapy
impairment with content broadly similar for
or without each child to allow for

phonetic disorder; comparison rather than the
no speech therapy eclectic approach used in
within 3 months clinical settings.
before the
experiment.
Fagel and Cohort To investigate the use of a 2 sessions 8 4–7 year old Pre- and post-test of degree of Wilcoxon signed ranks test
Madany (pre-test and virtual talking head to children with lisp measured on a five point for non-parametric data
(2008) post-test and improve the degree of lisping scale. showed six of the eight
one week children’s lisping. (interdentalisation children had a significantly
follow up). of/s/and/z/sounds.) improvement following the
first lesson. Three of the eight
had a persistent positive
effect after the second
learning lesson.
Wild (2009) Randomized Randomized control trial One 20 min 127 (64M, 63F): 5–6 Pre- and post-test of Pre- and post-test scores
control trial investigating session per year old normal phonological assessment showed the computer-based
computer-aided week across 6 children arranged battery (PhAB), Marie Clay group improved more than
instruction for practicing weeks in 3 groups of dictation test. the other two groups. The
phonological awareness paper-based group Marie Clay dictation test did
skills in beginning (21M, 22F), not show a statistically
readers. The experimental computer group significant improvement.
group was compared to (22M, 22F), and a
traditional intervention control group
and no intervention. (21M, 19F).
Danubianu Control trial To determine if the use of 24 meetings 40 5–6 year old Pre- and post-test of Groups were parametrically
et al. (2009) (Pre-test and an expert fuzzy logic across 12 children with proportion of the correct tries and statistically equivalent.
post-test) system is as effective as a weeks dyslalia to distinguish sounds and Both groups progressed in
speech therapist. (difficulties words. The scores were therapy. Both groups
pronouncing the R generated by the VST. performed equivalently;
and S sounds). Mann–Whitney test used to therefore, exercise choice can
assess difference between be done as effectively by the
groups; Wilcoxon for expert system as the speech
difference between pre and therapist.
post-test.
Saz et al. Randomized Students were exposed to Not reported 14 (7M, 7F): 11–21 Qualitative evaluation: All different environments of
(2009) Control Study different environments of year old students Interviews with therapists and a VST named “Comunica”
a VST named with different students to evaluate the was used at a special
“Comunica” as levels of outcomes of using the system education school and showed
“Prelingua,” “Vocaliza,” dysarthria. For at school. For an evaluation of very successful results as the
and “Cuentame” to normal speech an the accuracy of the tool itself, therapists and students
improve their unimpaired corpus word detection accuracy confirmed. The therapists also
phonological, articulatory was used (WDA) was considered. evaluated positively the ease

and language skills consisting of 168 Quantitative evaluation: Word of use of the application.
respectively. Different (73M, 98F) 10–18 detection accuracy (WAC) Results of the
games and different types years old speakers. was considered to evaluate speaker-independent mode
of feedback were the accuracy of the tool itself showed impaired speech
developed within each in three modes of deteriorates recognition by
environment for different speaker-independent, 29.94%. Task independent
types of speech task-dependent and results show an improvement
impairments. speaker-dependent. up to 22.92% from the
speaker-independent result.
Speaker-dependent results
show a dramatic improvement
of 63.83 over the
speaker-independent results.
The authors concluded that
the presented VST delivers
performance similar to
experts’ agreement rate.
Thompson Control trial A comparison between a Four 1 h 12 (10M, 2F): 35–68 Pre- and post-test of the Resulting data from pre- and
et al. (2010) VST named “Sentactics” sessions per years old, percent correct performance post-treatment tests showed
and clinician-delivered week for at agrammatic for both sentence significant improvements in
Treatment of Underlying most 5 weeks aphasic speakers. comprehension and sentence the post group. A significant
Forms (TUF) for 6 receiving production was measured by improvement was also seen
generalized production computer-based SLP. between the post group and
and comprehension of therapy and the the control group. The
complex sentences. other 6 as significance of difference was
experimental analyzed by the Wilcoxon
control and signed ranks test.
receiving no
therapy.
107
108
Table 1 (Continued)
Schipor et al. Control trial To determine if the use of 24 sessions 20 5–6 year old Pre and post-test of the No significant differences in
(2010) (pre-test and an expert fuzzy logic across 12 children with correctness of pronunciation performance between the two
post-test) system is as effective as a weeks dyslalia (generated by the VST), using groups, indicating the
speech therapist. (difficulties in nonparametric ManWitney exercise choice by the expert
pronouncing “R” and Wilcoxon tests. system is possible.
and “S” sounds).
Chaisanit et al. Control trial Comparison of an Not reported 10 9th grade hearing Pre and post-test of the There was a significant

(2010) (Pre-test and interactive multimedia impaired students correctness of vowel difference between both
post-test) courseware with divided in two production. T-test statistical traditional and the
traditional therapy for the groups of control method was used to analyze experimental group and the
hearing impaired in vowel group (learning by pre-test and post test data. control group using t-test.
training. traditional The courseware was more
method) and effective than using
experimental traditional methods alone.
group.
Stacey et al. Cohort To evaluate the 1 h training 11 (6M, 5F): 23–69 Percentage of the word and Mixed results: a statistically
(2010) (Pre-test and effectiveness of a sessions 5 year olds. sentence identification significant improvement was
post-test) computer-based days per week Postlingually accuracy for various seen on the consonant test,
self-administered training across 3 weeks deafened adults. conditions: IEEE sentence and the IEEE test just failed to
package that was test, BKB sentence test, reach significance. No other
designed to improve the consonant test, vowel test, improvements were seen.
speech perception among GBI Questionnaire.
adults who have used
cochlear implants for
more than three years.
George and Cohort To investigate the best 2 sessions 10 (6M, 4F): 4–36 Average number of attempts Results indicated by t-test that
Gnanayutham user interface (none, year olds (7 to imitate the 12 training a personalized interface had
(2010) group or individual) for normal and 3 sounds. an advantage (better success
improving training people with a few rates with producing sounds)
sounds. unclear words). over no visual representation
and the group interface.
Silva et al. Cohort To verify the applicability 30 min 17 10 children with Hearing in noise test (HINT) Auditory training using
(2012) (pre-test and of a software application sessions twice cochlear implants in quiet and in noisy SARDA improved speech
post-test) in the rehabilitation of a week to (mean age: 8 years conditions. perception ability in quiet and
hearing-impaired children accomplish and 4 months), 7 noise for hearing-impaired
(hearing aid users and software children with children. Compared with the
children with cochlear strategies hearing aids pre-test results improvements
implants). (mean age: 10 were significant for both
years and 4 groups; however, the results
months). of both groups were not
different to each other.
Hawley et al. Control trial To develop a voice-input 1 h daily 9 (4M, 5F): 30–80 Word recognition accuracy Taking advantage of repeated
(2013) voice-output training year olds with (WRA) and task completion practice of the system, the
communication aid for session for moderate to severe performance. utterance of the person under
people with severe speech 2–4 weeks dysarthria. therapy became more
impairments and assess recognizable. Not all
the patients’ improvement participants could complete
after using the device. all evaluation stages. The

The authors also tried to participants also agreed that
evaluate participants’ the system would be more
desires to use the system. useful if it could produce
more outputs.
Abad et al. Control trial An online VST is 153 min for 16 (10M, 6F): 19–78 Word naming score (WNS) The system could generate
(2013) developed for people with 8 × 113 year olds with and word verification rate scores matching the SLP
aphasia to perform word exercises aphasia. (WVR). scores with an accuracy
naming training exercises sufficient for a satisfactory
to recover their word recovery of the patients.
naming ability.
van Vuuren Randomized The intervention is based 2 training 8 25–66 years old Average sentence level word The results show that in
and controlled on script treatment. The conditions, with accuracy. relation to persons with
Cherney cross-over system has easy to use both for 3 mild-moderate to aphasia, using an ecologically
(2014) study navigation facilities and a weeks. 90 min severe chronic valid real-world treatment
talking 3D model to read per day in 3 aphasia. strategy with more cues can
and carefully articulate sessions for 6 lead to faster learning. The
each word. Then the days per week effect size for computer
person under therapy therapy was shown to be large
reads the sentence which and significant.
is being recorded by the
system. There are some
assistance options for the
user to replay each word,
listen to his or her
utterance, and watching
the model’s oral-motor
movements.
109
Table 2
Technological elements of VSTs introduced in shortlisted publications.
Ref. CBST name CBST details CBST technology
Segers and Verhoeven Not named in paper An education software Speech could be manipulated by the researchers,
(2004) program focusing on there were no distracting entertainment elements
phonological awareness and there was 3.5 h of training material. Speech
skills (rhyming and manipulation for group 2 used a pitch-synchronous
phoneme synthesis game). overlap and add algorithm (PSOLA). The program
10 different games, each provides feedback on each action performed. Help is
with five exercises, offered when the child is incorrect on the second
increasing in difficulty. attempt (the correct answer is highlighted), so that a
child cannot be ‘stuck’ in the game.
Massaro and Light Baldi Computer animated The facial animation program controls a wireframe
(2004) talking head language model which is texture mapped with a skin surface.
tutor for speech Baldi can be aligned with synthetic or natural
perception and speech. Baldi can be made transparent so that the
production. 8 categories interior parts can be visible. Amplitude, pitch, rate
(4 voiced vs. unvoiced, 3 and emotion are also expressed during speaking.
consonant cluster Baldi has a tongue, hard palate, and 3D teeth and his
distinctions, 1 fricative vs. internal articulatory movements have been trained
affricate distinction) with electropalatography and ultrasound data from
natural speech. Baldi can perform additional cues –
e.g. voiced and voiceless segments. Speech can be
slowed down. Baldi’s articulation is aligned with
auditory speech produced by the Festival text to
speech synthesizer. The program was designed
using the rapid application developer (RAD). A
happy or sad face indicates responses. The voice
recognition system in the CSLU toolkit was found
to be inaccurate.
Moore et al. (2005) Phenomena Adaptive language It runs from a CD on a Pentium II and above
training computer game processors. Sound sets (n = 11) were constructed as
based on discrimination continua of 96 sound files compressed into a single
of phonemes data file. 94 files between the endpoints were
representative of the obtained by interpolating between the acoustic
major phonological parameters of the endpoint files. Four tokens of the
categories of (British) 22 syllables (11 sound sets) were obtained at
English. 44.1 kHz and down-sampled to kHz. Sound files in
each continuum were generated using linear
prediction analysis and re-synthesis. Only spectral
parameters (reflection coefficients) were
manipulated. Spectral parameters were 15 reflection
coefficients over a 6.7 ms window and included
fundamental frequency and voicing. 30-min training
blocks and graphics designed by commercial game
developer. A tutor (dinosaur character) mimes a
syllable, two cavemen characters mime a syllable,
one identical to the tutor, one mistake. Adaptive
staircase procedure was followed to vary the
difficulty level of the matching task. Correct and
incorrect responses were indicated by a bell or
hooter sound immediately after the response. A
cumulative score indicates corner of the screen.
Difficulty of the game level is indicated by an
elevator gauge in the top corner. A brief animation
follows each game. An additional arcade-style
section was included as reward/motivation.
Table 2 (Continued)
Eriksson et al. (2005) ARTUR Speech training aid with a Recommendations. General: Speech therapist is
virtual speech tutor irreplaceable and must be present in all stages of
ARTUR who can use 3D speech teaching. The system must allow the SLP to
animations of the face and easily adapt and refine the exercises depending on
internal parts of the the development and motivation of the child during
mouth to give feedback supervised sessions. The system would be of most
on the difference between help if it supports repeated practice without SLP.
the user’s deviation and a The system should support language learning.
correct pronunciation. Material should relate to daily life of users.
User profiles: there should be a possibility to store
the sounds being trained. The system should provide
the possibility to store pictures and text together
with the speech sounds. Exercises: all speech sounds
must be present in a system. Motivation: To
motivate children, it should contain game-like
features. Both criticisms and rewards are necessary,
negative feedback should never outbalance rewards.
Rewards: rewards should be distributed according to
the effort made by the child, not only the result. The
teacher should be able to vary the amount of reward.
Acoustic models: if pre-recorded models are used, it
is important to have modes of both sexes and a range
of ages; if the child’s own production is used as a
model, it is important that recording and storing is
easy. Visual models: the level of detail of visual
models should be low and stylized. A 3D model for
visual feedback is supported. In an animated model,
some information can be excluded, but the palate
and jaw must be present as a reference frame for the
tongue, important features of the articulation should
be highlighted in a model.Cues and feedback: the
system should give the possibility to enhance cues
with sign language, additional visual signals or
written text.
Engwall et al. (2006) ARTUR Speech training aid with a The ARTUR tutor had pre-recorded natural speech
virtual speech tutor (in French) and time-aligned synthesized
ARTUR who can use 3D movements. Sessions begin with an explanation of
animations of the face and the training goal and the functionality of user
internal parts of the controls. The interface had four info windows and
mouth to give feedback one user control frame. The user can request an
on the difference between audio/visual animation of the target word at normal
the user’s deviation and a or slow speeds. For the first attempt at each word, a
correct pronunciation. normal view of the face was shown. Subsequent
attempts used the augmented reality display.
Changing the background color prompts the user to
speak. Corrective feedback differed between words.
The system’s Automatic Speech Recognition (ASR)
uses articulatory feature classification rather than the
phoneme classification method as the focus can be
on particular features that were not produced by the
speaker. An acoustic-to-articulatory inversion is
used together with input parameters of the speaker’s
face.
Table 2 (Continued)
Cole et al. (2007) LSVT virtual Computer program of Development of the LSVT system included these
therapist system LSVT – an effective tasks: a. Control of verbal and non-verbal behaviors
prototype (Marnie). speech therapy that of the virtual therapist in response to vocalizations
teaches individuals with produced by the person under therapy; b.
Parkinson disease to think human–computer interface for five LSVT exercises;
and speak more loudly. c. accurate and intuitive feedback to the user about
vocalizations; d. encouragement and reinforcement.
The development phase involved specifying and
codifying the rules that govern the clinician’s
response to the patient’s vocalizations. As well as
verbal and visual feedback, each utterance was
accompanied by multimedia displays (e.g. sustained
‘ah’ causes a car to travel forward). ‘Marni’
provides encouragement by smiling and nodding
and saying phrases (e.g. ‘go, go, go’).
Wren and Roulstone Phoneme Factory Phonological awareness Configuration of activities: 19 phonemes or targets;
(2008) activities – phoneme single sounds, non-words, real words and words in
detection, rhyme sentences; initial, medial and final word position;
awareness, phoneme clusters and polysyllables; sound on or off; pictures
blending, minimal pairs. on or off; adjustable length of time between
Interactive games are set presentation of phonemes in phoneme blending
in a bubble factory. An activity; number of pictures/stimulus material to
adult female voice with a select in blending activity.
UK accent provides
stimulus sound and
words.
Fagel and Madany Vivian Developed for speech Written in HTML/JavaScript, the application is
(2008) therapists to show and platform independent and runs from a CD-ROM
explain articulatory without installation. It requires a standard web
processes to people with browser equipped with a 3D plug-in to display the
phonetic disorders. VRML animations. User interface: left column
dedicated to controlling software, a table of speech
sounds is located at the top of control section.
Sounds are connected to five lists of items,
containing the sound in isolation, word initial
position, word medial, word final, and word initial
cluster. The therapist can change the display setting
for skin (normal or transparent), detail (full face or
zoom), orientation (frontal or side), and speed
(normal or slow). The right side shows the animated
head. A time slider allows speed control. A modular
audio-visual speech synthesizer (MASSY) was used
to generate the audio-visual speech utterances.
Phonetic transcription is realized by embedding the
external program txt2pho from the HADIFIX
speech synthesizer. Audio synthesis generates the
signal from the phonetic data and is rendered by the
MBROLA speech synthesis engine. Visual
articulation was generated with a simplified
dominance model. The face module was created by
Head Designer, a plugin for 3ds Max.
Wild (2009) Pack A CD ROM CD Rom Exercises based It supports speech-based feedback on the user’s
activities from the on the Story Rhyme performance. Feedback confirms a correct response
‘Rhyme and Analogy Photocopy masters. or corrects an incorrect response – in both instances,
program’ published the target onset-rime pattern is repeated. The
by Oxford University program also intervenes to prevent repeated failure
Press/Sherston on the part of the learner.
Software
Table 2 (Continued)
Danubianu et al. LOGOMON An intelligent system for System objectives: Evaluation of children to
(2009) the individual therapy of measure their progress, database to assist
speech disorders assessment methodology, development of an expert
(especially dyslalia). system to design training schedule, development of
the therapeutic guide to merge classical therapy
methods with the audio-visual system, database of
child’s personal progress in therapy.
System architecture: Functional blocks were
identified as: child, speech therapist, lab monitor
program, 3D model, expert system ad child monitor
program. The monitor program allows records of the
child’s speech with instant audio feedback and
history of audio recordings. The home monitor is
developed for PC and PDA. It runs exercises in a
game manner, offers feedback and can perform
statistics on subject scores. The home monitor
transmits homework to the child’s PC or PDA. It
allows the speech therapist to collect textual and
audio information about each child and can manage
therapy. The 3D model provides a view of
articulators and mouth; it can be seen in transparent
view. The expert system makes suggestions about
training parameters.
Saz et al. (2009) Comunica as a An affordable strategy to Pre-lingua: Signal preprocessing (DC offset and
combination of help human speech pre-emphasis), energy of each frame is calculated
environments: therapists in the diagnosis (in voice activity games), and a threshold is applied
PreLingua: an and treatment of students to determine if the voice is presented on frame (in
environment for with speech impairments. intensity games). The signal is windowed, and a
computer-aided The system interacts with linear prediction coefficients analysis is applied to
phonation acquisition; the user through dialogs estimate the vocal tract transfer function. The auto
Vocalize: an to prepare training, collect correlation of the prediction error leads to the
environment for user’s speech, and provide extraction of the fundamental frequency range (tone
articulation performance feedback to games), and sonority level (breathing games) and
acquisition and speech the human therapist. extraction of the roots of the transfer function are
training; Cuéntame: used to extract the formant frequencies and values
an environment to (in vocalization games). Vocaliza: User interface
motivate students to requires speech input and output (text, audio and
use the language by images) appears automatically. Feedback is
means of different provided by the speech technologies, namely:
scenarios and automatic speech recognition (ASR), speech
appropriate feedback. synthesis, speaker adaptation and pronunciation
verification. Useful user profiles are generated for
the therapist. Cuéntame: Uses ASR, speech
synthesis, speaker adaptation and PV technologies.
It also benefits from the speech verification
algorithm and morpho-syntactic analysis and
key-word spotter. Data Corpus: Dysarthric and
non-impaired speakers were recorded using the
Induced Phonological Register (Monfort and
Juárez-Sánchez, 1989) a powerful set of words used
in speech therapy containing examples of all
phonemes and 70% of allophones. Every phoneme
was labeled as deleted, mispronounced or correct.
Thompson et al. Sentactics/Sabrina A virtual Clinician None described.
(2010) (Sabrina) treats
participants using the
TUF object relative
training protocol.
Table 2 (Continued)
Schipor et al. (2010) Logomon Exercises for cheeks, lips It is an expert system based on a formalized
and tongue; supervised therapy guide using a fuzzy logic paradigm with
inspiration and expiration 230 rules and 22 linguistic variables. The fuzzy
from the temporal and inference process involves fuzzification, rules
intensity standpoints, evaluation, aggregation and defuzzification. e.g.
onomatopoeic a fuzzy rule: if a defect of speech is small and
pronunciation, rhythmic child’s age is old and family involvement is
pronunciation exercises, advanced, then the number of weekly sessions is
distinguishing along the reduced. Functional blocks are: child, speech
paronyms; the therapist, lab monitor program, expert system,
pronunciation sound of a 3D model, child monitor program.
direct, inverse and
complex syllables, of
words, or paronyms etc.;
complex contexts –
sentence, short stories,
poems, riddles.
Chaisanit et al. (2010) Interactive Trains hearing impaired Dynamic computer graphics were used to
Multimedia students in pronouncing establish an animation display system which
Courseware of Vowel vowels. shows the relationship between the position of
Training for the the nasal cavity, mouth, tongue, pharynx, velum
Hearing Impaired and lip during speech sounds. An integrated
design process was used including four design
phases: requirements analysis (requirement’s
specification, features and components
identification and design goal setting),
conceptual design (design scenarios
development, information design, structure
design and page design), development
(low-fidelity prototyping, design walk-through
and high-fidelity prototyping) and formative
evaluation (expert review, one-to-one evaluation
and small-group evaluation).
Stacey et al. (2010) Not named in paper A computer-based Word training task: 400 key words were
self-administrated created, and three foils were created for each key
training package that was word forming quasi-minimal pairs. 20 speakers
designed to improve the recorded 80 words each, giving 1600 words used
speech perception among for training.
adults who had used Sentence training task: 420 IEEE sentences
cochlear implants for and 180 low predictability Speech in Noise
more than three years. sentences were used. Visual feedback on
accuracy was given by a green tick next to a
word in a sentence or red if not in the sentence.
The sentence was repeated and presented
acoustically and orthographically more to
maximize lexical feedback. The material was
recorded by the same 20 speakers. The main
menu appeared when the computer was switched
on. Participants were able to select
word-training, sentence training, or quit. The
time spent completing tasks is displayed in the
left-hand corner and the percentage of words
correctly identified is displayed in the top
right-hand corner.
Table 2 (Continued)
George and Not named in paper Assistive technology for children The interface has an introduction page, main page,
Gnanayutham with phonological disorders. exit page, preference page and help page. The main
(2010) page has available sounds together with the front view
and 3D model animation and another animation to
help the user relate to the sound. The preference page
allows the user to choose sounds, background color,
animations and gender of sound. The help page
explains the preferences page. Applied theories
included: Low Fidelity Prototyping – eight golden
rules of interface, design, three principles of design;
Widgets – multiple intelligence, style guides to
widgets; Interaction design – split attention effects,
anchored instruction, GOMS model, TAGs,
Interaction Design, Accessibility, Recognizing
Diversity, Getting the user’s attention; Interface
Properties – psychological principles of interface
design, interactivity and graphic design, accessibility,
principles of interaction elements, icon properties,
non-anthropomorphic design; High Fidelity
Prototyping – eight Golden rules of interface design,
three principles of design, aids to multimedia learning;
Requirement Specification – Information-processing
theory. Additional features: easy navigation, a 3D
demonstration of sound with audio and a 2D animated
example of sound. The interface has both visual and
auditory output. Audio clips recorded using Pro Tools.
Jolly Phonics online resources were used for the 2D
animation. Maxon’s Cinema 4D was used for the 3D
interface. Macromedia Flash and Macromedia
Director were the authoring tools. The 3D models
showed well-defined parts inside the mouth used in
speech. Each animation was up to 3 s and showed the
firm contact and release of relevant organs of speech.
Semiotics, mental models and human–computer
interaction (HCI) principles were used in defining the
interactivity of the interface.
Silva et al. (2012) SARDA The technology is based on “Fast Not discussed. SARDA project development was
ForWord” Language – used to published in a paper in Portuguese (Balen et al.
decrease learning and language Relatorio Tecnico Prcial de Execucao do Projecto
difficulties and can be used by Software Auxiliar na Reabilitacao de Deficientes
teachers for stimulation of auditivos 2006.
auditory abilities. 6 strategies.
Each with three stages, four
phases and three difficulty levels.
Hawley et al. (2013) Voice-Input The device tries to help people In the development of the system, a user-centric
Voice-Output with severe speech impairments approach is followed. The process is triggered by
Communication Aid (moderate to severe dysarthria) to obtaining the user’s disordered utterance through a
(VIVOCA) improve their communication microphone. The speech signal is then forwarded to a
ability. It obtains the users’ speech recognizer where the uttered words are
disordered speech and generates recognized and passed to the message builder module.
a synthetic speech with the same The message builder generates the text format of the
concept. By repeated practice, final message and forwards it to the speech
not only does the device become synthesizer for playing. The speech recognition
more adapted to the user’s voice, module uses HMMs to get trained by the incoming
the utterance of the user is also speech. Therefore, the system can be adapted for an
improved. individual speaker using a limited speech data which
is a challenge for disordered speech recognition.
Table 2 (Continued)
Abad et al. (2013) Virtual therapist for Word naming training for people Acting as an online system for aphasia virtual
aphasia treatment with aphasia through a treatment, VITHEA is designed as a web application
(VITHEA) web-based environment. system, basically to ask the person under therapy to
recall the content shown in a picture. The system
architecture is designed to support two application
modules, namely: patient application module and
clinician application module. Different heterogeneous
technologies have been integrated to develop the
architecture. The system is developed by open source
frameworks such as Apache Struts 2, Hibernate, and
Spring. The system uses a speech recognition engine,
text-to-speech synthesizer, and a virtual face
animation engine. Adobe® Flash® technology is used
to design an interactive user interface.
van Vuuren and AphasiaRx A simple and user-friendly The system is based on a portable architecture which
Cherney (2014) portable software system for can be used on mobile phone, tablets, and PCs. The
people with fine motor skill 3D model is accurately tailored in terms of
difficulties. The main role is articulatory movements. For each script, natural voice
played by a 3D virtual talking is synchronously matched with the model face. Each
head which is responsible for sentence is supported by multi-colored labels timely
reading pre-provided scripts with matched with the model’s utterance. The person under
accurate articulation. It can help therapy can control the session using several buttons
the user to imitate the articulation to focus on any part of the script and model’s
and also the utterance in each articulation. The system can generate records and
session. logs, which are very useful for SLPs to archive and
profile the users’ data.
Fig. 2. Publication trend of the shortlisted studies.
Gnanayutham, 2010; van Vuuren and Cherney, 2014), IEEEXplore (Chaisanit et al., 2010; Hawley et al., 2013), Taylor
& Francis (Engwall et al., 2006; Eriksson et al., 2005), Informa (Stacey et al., 2010; Wren and Roulstone, 2008), and
Wiley (Cole et al., 2007; Wild, 2009), and finally one from the ACM digital library (Fagel and Madany, 2008). One
study was selected from the Computing and Informatics Journal15 (Schipor et al., 2010) and one from International
Journal on Advances in Life Sciences16 (Danubianu et al., 2009).
15 http://www.cai.sk/ojs/index.php/cai.
16 http://unitedlifejournals.com/ijals/.
Fig. 3. The box plot for participant sample size in the 20 selected publications.
Details of the selected publications are presented in Table 1, ordered by publication date, including related references,
study types, description of interventions, participant sample size, characteristics of the participants, outcome measures,
and descriptions of the outcomes of the study. Table 2 also presents details about the twenty selected publications, but
in a technical way, focusing on the developed VST. It refers to the name of the developed system, its application details,
and its technological aspects. To generate both tables, one review author extracted data from the included studies, and
the second review author double-checked and updated the data, based on the source publications. The disagreements,
which were mostly about how to describe the interventions and outcomes in the tables, were resolved by discussion
with the co-authors.
5.1. Sample size
As shown by the box plot presented in Fig. 3, sample sizes ranged from n = 0 (Cole et al., 2007) to n = 127 (Wild,
2009). The median for the measure of sample size among the reviewed studies was 13 (SD = 26.992).
5.2. Participants
Fig. 4 presents a box plot for the participants’ age ranges in the selected studies. The median is calculated as 11
(SD = 21.69). In relation to gender, based on the available information from the selected studies, 173 males and 130
females participated in the experiments. Three papers did not present any information on the gender of the participants
(Danubianu et al., 2009; Fagel and Madany, 2008; van Vuuren and Cherney, 2014).
5.3. Type of studies
To generate this systematic review, we attempted to gather the most reliable studies on VST; however, the authors
concluded that it was not a good idea to only include articles involving randomized controlled trials, for instance,
hence it was decided to include both qualitative and quantitative studies in our shortlisted publications. Qualitative
studies were based on user interviews as important client input during the analysis stage of the application development
(Huttunen et al., 2014). Of the selected papers, two studies were exclusively based on interview sessions (Engwall
et al., 2006; Eriksson et al., 2005) reporting improvements in the development of VSTs brought about by the results
of the interviews. Two studies conducted interview sessions to qualitatively analyze the usability of their program
(Cole et al., 2007; Saz et al., 2009), while providing quantitative test results as well. In all the four studies containing
interviews, the purpose of the interviews was to scrutinize the developed systems, not the user.
Eighteen papers presented quantitative studies. Of these, sixteen studies measured the effects of the VST therapy on
people with disorders; while two papers studied the efficiency of the VST in recognizing disordered speech (Hawley
Fig. 4. The box plot for participants’ age distribution in the 20 selected publications.
Fig. 5. Study types for the 20 selected publications (two publications contain both qualitative and quantitative studies).
et al., 2013) and evaluating the similarities between the VST results with the results generated by SLPs (Abad et al.,
2013). Control trial studies were reported in eight papers (Abad et al., 2013; Chaisanit et al., 2010; Danubianu et al.,
2009; Hawley et al., 2013; Moore et al., 2005; Schipor et al., 2010; Segers and Verhoeven, 2004; Thompson et al.,
2010), cohort studies reported in five papers (Fagel and Madany, 2008; George and Gnanayutham, 2010; Massaro and
Light, 2004; Silva et al., 2012; Stacey et al., 2010), randomized control studies reported in four papers (Saz et al., 2009;
van Vuuren and Cherney, 2014; Wild, 2009; Wren and Roulstone, 2008), and one study reported a single case design
(Cole et al., 2007). Fig. 5 shows the distribution of study types among the selected papers. The control trial study is
the most commonly used, being reported in eight papers.
5.4. Disorders
Fig. 6 analyses the VSTs described in the selected studies based on the type of disorders they were developed to
target. As the diagram shows, hearing impairment was the disorder most frequently addressed, as discussed in five
publications (Chaisanit et al., 2010; Eriksson et al., 2005; Massaro and Light, 2004; Silva et al., 2012; Stacey et al.,
2010). Of these five papers, four referred to children and young students. This reveals the importance of timely and
persistent rehabilitation programs for hearing impairments in children which, if untreated for too long, can have a severe
effect on children’s language learning (Buran et al., 2014; May-Mederake, 2012). VSTs have a remarkable potential to
create an interesting and engaging environment for these children for their therapy. Other disorders that were addressed
were dysarthria in three studies (Cole et al., 2007; Hawley et al., 2013; Saz et al., 2009), aphasia in three studies (Abad
et al., 2013; Thompson et al., 2010; van Vuuren and Cherney, 2014), dyslalia in two studies (Danubianu et al., 2009;
Schipor et al., 2010), specific language impairment (SLI) in two studies (Engwall et al., 2006; Segers and Verhoeven,
2004), phonological impairments in one study (Wren and Roulstone, 2008), lisping in one study (Fagel and Madany,
Fig. 6. Frequency of different types of disorders addressed by the shortlisted studies.
Fig. 7. Frequency of different types of intervention addressed by the shortlisted publications.
2008), and finally samples with a few unclear words which was addressed in one study (George and Gnanayutham,
2010).
5.5. Intervention
Various samples of VSTs were introduced and adopted in the selected studies. All the studies introduced their own
developed tools and used them for intervention except one study, in which previously developed software was used
(Wild, 2009). Interventions varied based on the technical abilities of the VST used, the disorder targeted, and the
intervention framework the SLPs had selected. Fig. 7 presents a high-level analysis of the intervention categories for
which VSTs were developed. Nine publications focused on articulation (Chaisanit et al., 2010; Engwall et al., 2006;
Eriksson et al., 2005; Fagel and Madany, 2008; George and Gnanayutham, 2010; Hawley et al., 2013; Massaro and
Light, 2004; Schipor et al., 2010; van Vuuren and Cherney, 2014) and eight publications considered phonological
awareness intervention (Abad et al., 2013; Moore et al., 2005; Segers and Verhoeven, 2004; Silva et al., 2012; Stacey
et al., 2010; Thompson et al., 2010; Wild, 2009; Wren and Roulstone, 2008). One study dealt with the acoustic features
of pitch, vocal loudness, and duration, which was categorized in phonation class (Cole et al., 2007). Two studies had the
features to deal with all three categories of articulation, phonological awareness, and phonation, which were classified
in the general class in the diagram (Danubianu et al., 2009; Saz et al., 2009).
5.5.1. Training stimuli

Based on the impairment, participants were trained and tested on different aspects of speech production and speech
comprehension skills. Of the selected studies, only six focused on small parts of speech for phonological awareness
and perception, such as phoneme synthesis, rhyming, syllabic awareness, word naming, etc. (Abad et al., 2013; Moore
et al., 2005; Segers and Verhoeven, 2004; Silva et al., 2012; Wren and Roulstone, 2008). In seven studies, sound and
words were considered for articulation assessment and training (Chaisanit et al., 2010; Engwall et al., 2006; Eriksson
et al., 2005; Fagel and Madany, 2008; George and Gnanayutham, 2010; Hawley et al., 2013; Schipor et al., 2010).
Fig. 8. Box plots for (left) total hours of therapy in hours (right) length of therapy sessions in minutes.
In three studies, longer parts of speech such as phrases and sentences were taken into account (Stacey et al., 2010;
Thompson et al., 2010; van Vuuren and Cherney, 2014). The authors in Massaro and Light (2004) claimed that their
word-level VST can be applied for both speech perception and speech production training. In Cole et al. (2007), the
authors developed a VST for vowel pronunciation; however, the implemented virtual tutor used various sentences as
audio feedback on the user’s sound level and to make the session more engaging. One study contained all levels of
speech, from isolated sounds and vowels to sentences (Saz et al., 2009). This product was developed as a package of
different games for both phonological and articulation training.
5.5.2. Duration and frequency of therapy sessions

The studies were heterogeneous in the sense of reporting the details of the conducted therapy sessions. Of the 20
selected studies, information on the therapy sessions was completely detailed in eleven studies, namely the number
of sessions, the length of each session, and details on how often the sessions were conducted (Engwall et al., 2006;
Eriksson et al., 2005; Hawley et al., 2013; Moore et al., 2005; Segers and Verhoeven, 2004; Silva et al., 2012; Stacey
et al., 2010; Thompson et al., 2010; van Vuuren and Cherney, 2014; Wild, 2009; Wren and Roulstone, 2008). Only
7 studies mentioned the number of sessions and/or the length of the sessions (Abad et al., 2013; Cole et al., 2007;
Danubianu et al., 2009; Fagel and Madany, 2008; George and Gnanayutham, 2010; Massaro and Light, 2004; Schipor
et al., 2010). There was no information on the therapy sessions in two of the studies (Chaisanit et al., 2010; Saz et al.,
2009).
The length of the therapy sessions varied from 25 min (a 10-min test session and a 15-min interview) as reported in
(Engwall et al., 2006), to 27 h reported in (van Vuuren and Cherney, 2014). Fig. 8 (left) shows the statistical behavior
for the length of the overall therapy in hours for twelve studies reporting this measure (see Table 1). The length of
each therapy session also varied from 10 min (Engwall et al., 2006) to 1 h (Hawley et al., 2013; Stacey et al., 2010;
Thompson et al., 2010). 30 min sessions were popular, being reported in four studies (Moore et al., 2005; Silva et al.,
2012; van Vuuren and Cherney, 2014; Wren and Roulstone, 2008). The statistical behavior of the length of each therapy
session is presented in Fig. 8 (right). For the studies in which the therapy took more than one week, the frequency of
sessions varied from one session per week reported in three studies (Segers and Verhoeven, 2004; Wild, 2009; Wren
and Roulstone, 2008) to seven sessions per week reported in Hawley et al. (2013).
5.6. Outcome measures
According to the type of intervention, various outcome measures were extracted from the selected studies. Four
studies presented qualitative results (Cole et al., 2007; Engwall et al., 2006; Eriksson et al., 2005; Saz et al., 2009),
three of these studies using interview sessions (Engwall et al., 2006; Eriksson et al., 2005; Saz et al., 2009), while the
Fig. 9. Frequency of the technological building blocks in the selected studies.
other one used both an interview session and questionnaires (Cole et al., 2007). All of these four papers studied the
usability of VSTs.
Eighteen studies presented quantitative results based on quantitative measures. Of these 18 studies, in 16 studies
SLPs were in charge of generating the outcome measures (Chaisanit et al., 2010; Cole et al., 2007; Danubianu et al.,
2009; Fagel and Madany, 2008; George and Gnanayutham, 2010; Massaro and Light, 2004; Moore et al., 2005; Segers
and Verhoeven, 2004; Silva et al., 2012; Stacey et al., 2010; Thompson et al., 2010; van Vuuren and Cherney, 2014;
Wild, 2009; Wren and Roulstone, 2008). In one study, the same measures were generated by both the VST and the
SLP (Abad et al., 2013). The goal of this study was to analyze the similarity between the generated measures. One
study considered two measures of word recognition accuracy (WRA) generated by the VST and the measure of task
completion performance as determined by the SLP.
We extracted various quantitative measures considering both speech production and speech comprehension skills.
In relation to speech production, we extracted various measures such as the proportion of correct tries to pronounce
words as mentioned in three studies (Massaro and Light, 2004; Schipor et al., 2010; van Vuuren and Cherney, 2014),
sound pressure level (Cole et al., 2007), percentage of consonants correct (PCC) for the Goldman Fristoe Test of
Articulation (GFTA) (Wren and Roulstone, 2008), degree of lisping (Fagel and Madany, 2008), the percentage of
correct performance in sentence production (Thompson et al., 2010), correctness of vowel production (Chaisanit et al.,
2010), correctness of sound production (George and Gnanayutham, 2010), and task completion performance (Hawley
et al., 2013). The studies which considered speech comprehension focused on measures such as phonological awareness
task scores (Segers and Verhoeven, 2004), word discrimination test score and phonological assessment battery (PhAB)
(Moore et al., 2005; Wild, 2009), percentage of the word and sentence identification accuracy (Stacey et al., 2010),
hearing in noise test (HINT) score (Silva et al., 2012), word naming score (WNS), word recognition accuracy (WRA)
and word verification rate (WVR) (Abad et al., 2013).
5.7. VST technological building blocks
To design a successful VST, an appropriate therapy program should be simulated and delivered by the integration of
different algorithms and technologies. Here, we mention some of these technologies that were used in the introduced
VSTs in the selected studies. We categorized five major building blocks namely automatic speech recognition (ASR),
speech corpus, speech synthesizer, expert systems, and facial tracking systems. Fig. 9 shows the frequency of each of
these blocks in the shortlisted publications in which a speech synthesizer is the most frequently used technology men-
tioned in 10 studies, followed by ASR reported in nine studies and speech corpus used in eight studies. Facial tracking
and expert systems were each used in two studies. Fig. 10 shows the frequency of utilizing different technological
building blocks among VSTs targeting different disorders. As shown in this figure ASR, speech synthesis and speech
corpus were very popular in the studies to enrich VSTs.
5.7.1. Automatic speech recognition (ASR)

Automatic speech recognition (ASR) technologies are required when the VST needs to make decisions based on the
user’s received utterance. Fig. 11 presents the frequency of using ASR technology in the shortlisted papers according
to the type of intervention. Nine studies used ASR technology (Abad et al., 2013; Cole et al., 2007; Danubianu et al.,
2009; Engwall et al., 2006; Eriksson et al., 2005; Hawley et al., 2013; Massaro and Light, 2004; Saz et al., 2009;
Fig. 10. Frequency of using different technologies on the disorders.
Fig. 11. Frequency of using ASR in shortlisted publications in relation to the three intervention categories of articulation, phonation, and phonological
awareness.
Schipor et al., 2010). Of these nine studies, five used ASR in articulation intervention (Engwall et al., 2006; Eriksson
et al., 2005; Hawley et al., 2013; Massaro and Light, 2004; Schipor et al., 2010), one in phonological awareness (Abad
et al., 2013) and one in phonation therapy (Cole et al., 2007).
ASR has different technological layers, and an expert in speech processing is required to contribute if a VST needs
to have this capability. There are many toolkits and frameworks that deliver ASR services which are available as
open source tools or commercial products (Duarte et al., 2014). Appropriate ASR tools with suitable feature selection
strategies and an acceptable level of efficiency should be selected (Rong et al., 2009), as the authors in (Massaro and
Light, 2004) were not satisfied with the accuracy of their selected ASR toolkit. These challenges can lead to devising
ASR-free VST architecture, in which it is preferred to record input speech for further reference, as stated in four
studies (Fagel and Madany, 2008; Massaro and Light, 2004; Thompson et al., 2010; van Vuuren and Cherney, 2014).
Three studies preferred to conduct human rating sessions (Fagel and Madany, 2008; Thompson et al., 2010; Wren and
Roulstone, 2008) and seven studies developed special user interfaces through which a mouse and keyboard were used
as the input medium17 (George and Gnanayutham, 2010; Moore et al., 2005; Segers and Verhoeven, 2004; Stacey
et al., 2010; Thompson et al., 2010; Wild, 2009; Wren and Roulstone, 2008). An appropriate ASR technology could
have been used to remove such inconsistencies, leading to more accurate results.
5.7.2. Facial feature tracking

Two studies reported the use of facial feature tracking technology (Danubianu et al., 2009; Engwall et al., 2006).
Tracking the facial features of the person under therapy can provide very good feedback for the virtual speech therapist
to detect the user’s emotional state and articulation deficits. This information helps the VST to select the next operational
phase which can be delivering a new practice, providing new feedback, etc. To accomplish this goal, the face of the
17 Note that some intervention strategies generally do not need the voice of the person under therapy in either table-top or computer-based forms.
Fig. 12. Frequency of using speech synthesizers in the shortlisted publications in relation to the three intervention categories of articulation,
phonation, and phonological awareness.
person under therapy is constantly tracked by cameras, and the emotional and clinical models are built by the program
to infer the current status of the user (Chuang et al., 2014; Tjondronegoro et al., 2008). Facial feature tracking can
be merged with speech features to detect speech impairments, since the correlation between jaw and lip position, for
instance, can be detected for the corresponding speech acoustics (Danubianu et al., 2009; Engwall et al., 2006).
5.7.3. Speech synthesizer

Ten studies used speech synthesis in their VST design (Abad et al., 2013; Cole et al., 2007; Engwall et al., 2006;
Eriksson et al., 2005; Fagel and Madany, 2008; Hawley et al., 2013; Massaro and Light, 2004; Moore et al., 2005;
Thompson et al., 2010; van Vuuren and Cherney, 2014). Of these ten studies, six studies used speech synthesis for
articulation intervention (Engwall et al., 2006; Eriksson et al., 2005; Fagel and Madany, 2008; Hawley et al., 2013;
Massaro and Light, 2004; van Vuuren and Cherney, 2014), three studies focused on phonological awareness (Abad
et al., 2013; Moore et al., 2005; Thompson et al., 2010) and one study focused on phonation (Cole et al., 2007).
Speech synthesis is an important part of VSTs in the sense of human–computer interaction, since audio commands
and feedback are automatically generated through speech synthesizers. The authors in Massaro and Light (2004)
used this technology for their 3D head’s articulation by adopting Festival text-to-speech synthesizer. In Fagel and
Madany (2008), a modular audio-visual speech synthesizer (MASSY) was used to generate the audio-visual speech
utterances. Phonetic transcription was realized by embedding the external program txt2pho from the HADIFIX speech
synthesizer. Audio synthesis was used to generate the signal from the phonetic data and is rendered by the MBROLA
speech synthesis engine. In this study, visual articulation was also generated with a simplified dominance model.
Fig. 12 shows the frequency of using speech synthesizers in the shortlisted publications, showing the importance of
this technology in articulation therapy.
5.7.4. Expert systems

Two studies reported the use of expert systems in their VSTs (Schipor et al., 2010; Ward et al., 2011). Expert system
strategies may be used in different parts of VSTs to achieve a variety of goals. An expert system algorithm tries to
emulate human decision-making by reasoning through a knowledge base in a particular subject area (Tyler, 2007).
In Schipor et al. (2010), the authors designed an expert system to personalize the process of therapy, assist therapists
with exercise selection, and to accomplish a self-correcting strategy for the system’s knowledge base when differences
between system’s decisions and therapist’s decisions were revealed. They also used triangular fuzzy sets to create an
efficient model of the speech therapist’s decisions. This system was found to be equally as effective when measuring
pronunciation improvement as when a speech therapist made the exercise selection. However, in Eriksson et al. (2005),
the authors concluded that the speech therapist should be involved at all stages of therapy and be able to select exercises
based on patient’s progress and motivation. Another study described the development of an expert system that involved
specifying and codifying the rules that govern a clinician’s response to the user’s vocalizations (Cole et al., 2007).
5.7.5. Speech corpus

Speech has a variable and nonlinear nature which makes speech recognition a challenging task, particularly when
the system deals with disordered speech. All the reviewed studies that took advantage of ASR technology used learning
algorithms as well to create a speech model upon which the user utterance is processed. The training data for these
learning strategies are structured in a speech corpus containing both normal and disordered speech. Of the selected
Fig. 13. (left) The number of studies using 3D intervention technology for different disorders. (right) The number of studies using games for different
disorders.
publications, all the studies that used ASR technology reported developing their own speech corpora except one, in
which the type of intervention did not need a speech corpus since the algorithm was not based on a machine learning
strategy (Cole et al., 2007).
5.7.6. Other technologies

There were other technologies used to complete VSTs in specific cases. Electropalatography is such a technology
used in Massaro and Light (2004) along with ultrasound data from natural speech, in order to virtualize internal
articulatory movements.
5.8. Therapy delivery approaches
The therapy delivery approach is the human–computer interaction aspect of the VST. Various technologies and ideas
were adopted to deliver the required information to the user. The two most popular technologies used in the reported
VSTs were 3D virtual heads or animated heads and game-like user interfaces.
5.8.1. 3D virtual heads

Nine studies used animated heads as a therapy delivery method (Abad et al., 2013; Chaisanit et al., 2010; Engwall
et al., 2006; Eriksson et al., 2005; Fagel and Madany, 2008; George and Gnanayutham, 2010; Schipor et al., 2010;
Thompson et al., 2010; van Vuuren and Cherney, 2014). For articulation therapy, virtual talking 3D heads was the most
popular intervention technology where a real therapy session was simulated and virtualized with the virtual talking 3D
head acting as a guide for users in animated form to deliver instructions and feedback. These animated guides describe
how to use articulators, how to progress through the application and also how to generate useful audio-visual feedback
and make the experience more personal, as the user is guided by a character rather than the computer (Abad et al.,
2013; Cole et al., 2007; Thompson et al., 2010; van Vuuren and Cherney, 2014). In six designed VSTs, 3D heads were
implemented in a multilayer way so the user can switch to transparent skin mode or sagittal view to see inside the
articulators and the exact features that they may need to correct (Engwall et al., 2006; Eriksson et al., 2005; Fagel and
Madany, 2008; George and Gnanayutham, 2010; Massaro and Light, 2004; Schipor et al., 2010). One study provided
transparent skin mode with 2D animations within the user interface (Chaisanit et al., 2010). As shown in Fig. 13 (left),
judging from the selected papers, this kind of intervention seems to be popular for people with hearing impairment
and aphasia.
5.8.2. Computer games

In relation to different disorders, six studies used games to deliver the therapy. Game-like features were one of the
design recommendations put forward in a qualitative study with users of the ARTUR system (Eriksson et al., 2005).
This is a popular and useful type of intervention to indirectly deliver exercises when the target population is children.
Games, if well developed, make the therapy session more engaging and interesting. In the reviewed studies, games are
used to improve phonological awareness in four studies (Moore et al., 2005; Segers and Verhoeven, 2004; Silva et al.,
2012; Wren and Roulstone, 2008), vowel training in one study (Chaisanit et al., 2010), and articulation skills in one
study (Saz et al., 2009). The authors in Moore et al. (2005) adopted an arcade-style reward selection to motivate users
to follow the training stage. They also concluded that the difficulty of levels in game-like training was an important
part of therapy. Fig. 13 (right) shows the disorders targeted by computer games. As shown in the figure, similar to the
3D heads, the category of hearing impairments was the disorder most frequently addressed using games as a therapy
environments.
5.9. Support features
5.9.1. Cues
Whilst visual and audio cues given by 3D virtual heads were the most commonly used cues, a number of alternative
and additional cues were described in computer-based therapy interventions. In a study on users of cochlear implants,
visual feedback on accuracy was given by a green tick next to word in a sentence or red if it was not in the sentence.
The sentence was repeated and presented acoustically and orthographically to maximize lexical feedback (Stacey et al.,
2010). The authors in (Engwall et al., 2006) described the use of a screen color change and an optional written word
as extra prompts. An extra animation to assist the user to connect with a sound was also described in (George and
Gnanayutham, 2010). Similarly, in another intervention, each utterance was accompanied by multimedia displays (e.g.
sustained/ah/causes a car to travel forward) (Cole et al., 2007). Baldi as a 3D in (Massaro and Light, 2004) could
perform additional cues – e.g. voiced and voiceless segments for target sounds.
5.9.2. Feedback
It was emphasized that negative feedback should not outweigh positive feedback (Eriksson et al., 2005). Several
programs ensured that the child/user did not repeatedly fail an exercise and intervened to allow the user to progress
(Segers and Verhoeven, 2004). One study concluded that the amount and detail of feedback should adapt online to
performance and should only be given on a particular sound (other sound errors are logged for subsequent sessions).
However, the ten feedback options used in this CBST were too crude and missed smaller errors.
5.9.3. Personalized interface

An interface that can be personalized for color, sound and animation was found to be advantageous (George and
Gnanayutham, 2010). The storage of sounds, pictures and texts for individual users was also recommended in the
interview research by (Eriksson et al., 2005).
6. Discussions
We encountered a great deal of heterogeneity when generating this systematic review. Focusing on our research
questions, we attempted to manage the extracted information to draw an organized and structured conclusion. Here
we discuss the limitations of this systematic review and present our findings according to our research questions. The
quality of some of the studies was generally low, sample sizes were small and procedures were not outlined in sufficient
detail. Interventions, outcome measures and type of computer-based intervention varied remarkably; therefore, general
conclusions are difficult to draw. In the following, we discuss the extracted information in a way that answers our
research questions.
6.1. Disorders addressed by the VSTs
Within the scope of the types of the disorders addressed by VSTs, we categorized various disorders and as mentioned
in the result section (Section 5.4), hearing impairments were the most frequent disorder addressed by the shortlisted
studies, with five studies dedicated to them. Combining these five studies with the results extracted on the technological
building blocks showed that three of these five studies focused on articulation intervention (Chaisanit et al., 2010;
Eriksson et al., 2005; Massaro and Light, 2004); two used ASR, speech corpus and speech synthesis technologies
together with animated heads and games. The sample size of these five studies was less than 20 (median = 11, SD = 4.45).
Among them, three studies showed a significant difference in VST improvement. All five studies mentioned using small
parts of speech for the training stimuli; however, one of them considered both word and sentence level materials (Stacey
et al., 2010). It is also important to mention that four of these five studies focused on children for therapy. Based on these
data, we can state that VSTs developed for people with hearing impairment focus mostly on articulation, analysing
small parts of speech, using technologies like ASR, speech corpus and speech synthesis. They use game-like features
and animated heads to deliver the therapy.
Dysarthria and aphasia were the second most frequently addressed disorders, each being considered in three studies.
In relation to dysarthria, one study focused on phonation (Cole et al., 2007), one study on articulation (Hawley et al.,
2013), and one study considered both articulation and phonology (Saz et al., 2009). All three studies used ASR and
speech corpus and two used speech synthesis. In (Cole et al., 2007), an animated head was used to deliver the therapy
while in (Saz et al., 2009) games were used. Sample sizes were small (median = 9, SD = 6.55) and they used small
parts of speech (sounds and words) in their studies. In the case of aphasia, two of the three studies were based on
phonological awareness intervention (Abad et al., 2013; Beskow et al., 2008; Thompson et al., 2010) and one was on
articulation (van Vuuren and Cherney, 2014). One of the three studies used small parts of speech (Abad et al., 2013),
while the other two used longer parts of speech in their studies. Sample sizes were not large (median = 12, SD = 4). All
three studies used speech synthesis. Only one used ASR and speech corpus (Abad et al., 2013), while the other two
used animated heads to deliver the therapy. Based on these data, we can state that VSTs developed for people with
dysarthria have ASR and speech corpus in their technological building blocks; however, in the case of aphasia, speech
synthesizers and animated heads are more common. Five studies out of these six studies on dysarthria and aphasia
reported significant differences.
In relation to other disorders the results are heterogeneous but the common specification is the small sizes of the
participants recruited for the experiments.
6.2. Effectiveness of VSTs in therapy
In relation to experimental assessment, although almost all the studies agreed on the efficiency of VSTs in speech
therapy (compared with the no-therapy groups using the pre- and post- tests), only six studies compared the performance
of VSTs with traditional SLPs in speech-language therapy (Abad et al., 2013; Chaisanit et al., 2010; Danubianu et al.,
2009; Schipor et al., 2010; Thompson et al., 2010; Wild, 2009; Wren and Roulstone, 2008). The authors in (Abad
et al., 2013) compared the word naming score (WNS) generated by their VST with the SLP scores, finding that there
was no significant difference between them. The authors in (Schipor et al., 2010) compared their VST with SLPs in
the sense of training session management to determine the number, length and the content of therapy sessions. They
came to the conclusion that there was no significant difference between the group managed by the VST and the group
managed by the SLP. Two studies showed significant differences in the improvement of the participants as a result of
their designed VSTs (Chaisanit et al., 2010; Wild, 2009); however, two other studies reported no significant difference
in the performance of their VSTs compared with traditional therapy (Thompson et al., 2010; Wren and Roulstone,
2008).
6.3. Technological elements of VSTs
Based on our extracted technical data, the speech synthesizer, speech corpus and ASR are the most frequently
used technologies in the reported VSTs. ASR was used to analyze the incoming speech, while speech synthesis was
used to generate natural speech. The popularity of these two technologies reveals the importance of human–computer
interaction in the area of virtual speech therapist systems. Other technologies, such as face tracking and expert system
modules, were also used.
7. Conclusion
This systematic review was written based on the PRISMA Statement to study papers published on virtual speech
therapy environments developed for speech disorders. Data extraction was undertaken based on the following three
research questions: What types of articulation and phonological disorders do VSTs address? How effective are virtual
speech therapists? and What technological elements have been utilized in VST projects? The studies were very het-
erogeneous in terms of targeted disorders, sample size, outcome measures, assessment strategies, and technological
aspects. The studies were not even homogeneous in the way they described and assessed their VSTs. Despite the
inconsistencies, as we adopted an organized and structured way to review the studies, it is possible to reach some
useful conclusions to answer the research questions and help future work on VST. A study of the disorders that were
examined in the shortlisted papers (research question 1) showed that hearing impairments, aphasia, and dysarthria were
the most frequently studied disorders, being addressed in five, three, and three studies, respectively. Lisping, SLI and
phonological impairments were the other disorders examined in the studies. Hearing impairments were the most com-
mon disorder to be addressed by sophisticated VSTs involving complicated human–computer interaction technologies,
such as 3D virtual heads and games. There were many recommendations for customizing and individualizing VSTs,
according to the patients’ clinical condition. In relation to effectiveness (research question 2), all the studies agreed on
the effectiveness of the VSTs and there were no recommendations to not use VSTs; however, there was no consensus
on the superiority of VSTs over human speech pathologists. In relation to the technical issues (research question 3),
adopting ASR, speech synthesis and speech corpus was very popular in the studies to enrich VSTs. Based on the pub-
lications we reviewed, the technical challenges of integrating cutting-edge technologies to develop a united functional
VSTs dwarfs the clinical challenges. This complexity affected many studies in which the authors had to impose human
interference in their computer-based therapy because of the lack of speech recognition systems or speech synthesizers
embedded within their VSTs. Fortunately, these complex technologies are becoming more user-friendly and reusable,
making it possible to expect more standard and efficient VSTs in the future.
References
Abad, A., Pompili, A., Costa, A., Trancoso, I., Fonseca, J., Leal, G., Farrajota, L., Martins, I.P., 2013. Automatic word naming recognition for an
on-line aphasia treatment system. Comput. Speech Lang. 27, 1235–1248.
Beilby, J.M., Byrnes, M.L., Yaruss, J.S., 2012. Acceptance and Commitment Therapy for adults who stutter: psychosocial adjustment and speech
fluency. J. Fluen. Disord. 37, 289–299.
Bench, J., Kowal, Å., Bamford, J., 1979. The Bkb (Bamford–Kowal–Bench) sentence lists for partially-hearing children. Br. J. Audiol. 13, 108–112.
Berg, B.L., Lune, H., 2011. Qualitative Research Methods for the Social Sciences, 8th ed. Pearson, UK.
Beskow, J., Engwall, O., Granström, B., Nordqvist, P., Wik, P., 2008. Visualization of speech and audio for hearing impaired persons. Technol.
Disabil. 20, 97–107.
Bhattacharyya, N., 2014. The prevalence of voice problems among adults in the United States. Laryngoscope.
Buran, B.N., Sarro, E.C., Manno, F.A.M., Kang, R., Caras, M.L., Sanes, D.H., 2014. A sensitive period for the impact of hearing loss on auditory
perception. J. Neurosci. 34, 2276–2284.
Chaisanit, S., Suksakulchai, S., Nimnual, R., 2010. Interactive multimedia courseware of vowel training for the hearing impaired. In: 2010
International Conference on Control Automation and Systems (ICCAS), pp. 1196–1199.
Chien, C.-W., Rodger, S., Copley, J., Skorka, K., 2014. Comparative content review of children’s participation measures using the international
classification of functioning, disability and health – children and youth. Arch. Phys. Med. Rehabil. 95, 141–152.
Chuang, C.-H., Cheng, S.-C., Chang, C.-C., Chen, Y.-P.P., 2014. Model-based approach to spatial–temporal sampling of video clips for video object
detection by classification. J. Visual Commun. Image Represent. 25, 1018–1030.
Cole, R., Halpern, A., Ramig, L., Vuuren, S.V., Ngampatipatpong, N., Yan, J., 2007. A virtual speech therapist for individuals with Parkinson’s.
Dis. Educ. Technol. 47, 51–55.
Danubianu, M., Pentiuc, S.-G., Schipor, O.A., Nestor, M., Ungureanu, I., Schipor, D.M., 2009. TERAPERS – intelligent solution for personalized
therapy of speech disorders. Int. J. Adv. Life Sci. 1, 26–35.
Duarte, T., Prikladnicki, R., Calefato, F., Lanubile, F., 2014. Speech recognition for voice-based machine translation. Software, IEEE 31, 26–31.
Engwall, O., Bälter, O., Öster, A.-M., Kjellström, H., 2006. Designing the user interface of the computer-based speech training system ARTUR
based on early user tests. Behav. Inform. Technol. 25, 353–365.
Eriksson, E., Bälter, O., Engwall, O., Öster, A.-M., Kjellström, H.S., 2005. Design recommendations for a computer-based speech training system
based on end user interviews. In: Proceedings of the Tenth International Conference on Speech and Computers, pp. 483–486.
Fagel, S., Madany, K., 2008. A 3-D virtual head as a tool for speech therapy for children. Interspeech.
Frederickson, N., Frith, U., Reason, R., 1997. Phonological Assessment Battery (Manual and Test Materials). NFER-Nelson, Windsor.
Furberg, C.D., Friedman, L.M., 2012. Approaches to data analyses of clinical trials. Prog. Cardiovasc. Dis. 54, 330–334.
George, J., Gnanayutham, P., 2010. Developing multimedia interfaces for speech therapy. Univ. Access Inf. Soc. 9, 153–167.
Gillon, G.T., 2004. Phonological Awareness: From Research to Practice. Guilford Press, USA.
Goldman, R., Fristoe, M., 2000. Test of Articulation. American Guidance Services, USA.
Golonka, E.M., Bowles, A.R., Frank, V.M., Richardson, D.L., Freynik, S., 2012. Technologies for foreign language learning: a review of technology
types and their effectiveness. Comput. Assist. Lang. Learn. 27, 70–105.
Hawley, M.S., Cunningham, S.P., Green, P.D., Enderby, P., Palmer, R., Sehgal, S., O’Neill, P., 2013. A voice-input voice-output communication aid
for people with severe speech impairment. IEEE Trans. Neural Syst. Rehabil. Eng. 21, 23–31.
Henshaw, H., Ferguson, M.A., 2013. Efficacy of individual computer-based auditory training for people with hearing loss: a systematic review of
the evidence. PLOS ONE 8, e62836.
Huttunen, S., Manninen, K., Leskinen, P., 2014. Combining biogas LCA reviews with stakeholder interviews to analyse life cycle impacts at a
practical level. J. Clean. Prod. 80, 5–16.
Kagohara, D.M., van der Meer, L., Ramdoss, S., O’Reilly, M.F., Lancioni, G.E., Davis, T.N., Rispoli, M., Lang, R., Marschik, P.B., Sutherland, D.,
Green, V.A., Sigafoos, J., 2013. Using iPods® and iPads® in teaching programs for individuals with developmental disabilities: a systematic
review. Res. Dev. Disabil. 34, 147–156.
Khowaja, K., Salim, S.S., 2013. A systematic review of strategies and computer-based intervention (CBI) for reading comprehension of children
with autism. Res. Autism Spectr. Dis. 7, 1111–1121.
Laws, G., Hall, A., 2014. Early hearing loss and language abilities in children with Down syndrome. Int. J. Lang. Commun. Disord. 49, 333–342.
Liberati, A., Altman, D.G., Tetzlaff, J., Mulrow, C., Gøtzsche, P.C., Ioannidis, J.P.A., Clarke, M., Devereaux, P.J., Kleijnen, J., Moher, D., 2009.
The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and
elaborationPRISMA: explanation and elaboration. Ann. Intern. Med. 151, W-65.
Lidström, H., Hemmingsson, H., 2014. Benefits of the use of ICT in school activities by students with motor, speech, visual, and hearing impairment:
a literature review. Scand. J. Occup. Therapy, 1–16.
Massaro, D.W., Light, J., 2004. Using visible speech to train perception and production of speech for individuals with hearing loss. J. Speech Lang.
Hear. Res. 47, 304–320.
May-Mederake, B., 2012. Early intervention and assessment of speech and language development in young children with cochlear implants. Int. J.
Pediatr. Otorhinolaryngol. 76, 939–946.
McCormack, J., McLeod, S., McAllister, L., Harrison, L.J., 2009. A systematic review of the association between childhood speech impairment and
participation across the lifespan. Int. J. Speech-Lang. Pathol. 11, 155–170.
McLeod, S., Daniel, G., Barr, J., 2013. “When he’s around his brothers . . . he’s not so quiet”: the private and public worlds of school-aged children
with speech sound disorder. J. Commun. Disord. 46, 70–83.
Mesulam, M.-M., Weintraub, S., Rogalski, E.J., Wieneke, C., Geula, C., Bigio, E.H., 2014. Asymmetry and heterogeneity of Alzheimer’s and
frontotemporal pathology in primary progressive aphasia. Brain 137, 1176–1192.
Monfort, M., Juárez-Sánchez, A., 1989. Registro Fonológico Inducido (Tarjetas Gráficas). CEPE, Madrid, Spain.
Moore, D., Rosenberg, J., Coleman, J., 2005. Discrimination training of phonemic contrasts enhances phonological processing in mainstream school
children. Brain Lang. 94, 72–85.
Report, 2014. Prevalence of Different Types of Speech, Language and Communication Disorders and Speech Pathology Services in Australia.
Parliament House, Canberra, Australia.
Rong, J., Li, G., Chen, Y.-P.P., 2009. Acoustic feature selection for automatic emotion recognition from speech. Inf. Process. Manage. 45, 315–328.
Saz, O., Yin, S.-C., Lleida, E., Rose, R., Vaquero, C., Rodríguez, W.R., 2009. Tools and technologies for computer-aided speech and language
therapy. Speech Commun. 51, 948–967.
Schipor, O.-A., Pentiuc, S.-G., Schipor, M.-D., 2010. Improving computer based speech therapy using a fuzzy expert system. Comput. Inform. 29,
303–318.
Schulz, K.F., Altman, D.G., Moher, D., 2010. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ
340, c332.
Segers, E., Verhoeven, L., 2004. Computer-supported phonological awareness intervention for kindergarten children with specific language impair-
ment. Lang. Speech Hear. Serv. Sch. 35, 229–239.
Shriberg, L.D., Austin, D., Lewis, B.A., McSweeny, J.L., Wilson, D.L., 1997. The percentage of consonants correct (PCC) metric: extensions and
reliability data. J. Speech Lang. Hear. Res. 40, 708–722.
Silva, M.P.d., Junior, A.A.C., Balen, S.A., Bevilacqua, M.C., 2012. Software use in the (re)habilitation of hearing impaired children. Jornal da
Sociedade Brasileira de Fonoaudiologia 24, 34–41.
Stacey, P.C., Raine, C.H., O’Donoghue, G.M., Tapper, L., Twomey, T., Summerfield, A.Q., 2010. Effectiveness of computer-based auditory training
for adult users of cochlear implants. Int. J. Audiol. 49, 347–356.
Thompson, C.K., Choy, J.J., Holland, A., Cole, R., 2010. Sentactics® : Computer-automated treatment of underlying forms. Aphasiology 24,
1242–1266.
Tjondronegoro, D., Chen, Y.-P.P., Joly, A., 2008. A scalable and extensible segment-event-object-based sports video retrieval system. ACM Trans.
Multimedia Comput. Commun. Appl. 4, 1–40.
Tyler, A.R., 2007. Expert Systems Research Trends. Nova Science Publishers Inc., US.
van Vuuren, S., Cherney, L., 2014. A virtual therapist for speech and language therapy. In: Bickmore, T., Marsella, S., Sidner, C. (Eds.), Intelligent
Virtual Agents. Springer International Publishing, pp. 438–448.
Ward, W., Cole, R., Bolaños, D., Buchenroth-Martin, C., Svirsky, E., Vuuren, S.V., Weston, T., Zheng, J., Becker, L., 2011. My science tutor: a
conversational multimedia virtual tutor for elementary school science. ACM Trans. Speech Lang. Process. 7, 1–29.
Wild, M., 2009. Using computer-aided instruction to support the systematic practice of phonological skills in beginning readers. J. Res. Read. 32,
413–432.
Wren, Y., Roulstone, S., 2008. A comparison between computer and tabletop delivery of phonology therapy. Int. J. Speech-Lang. Pathol. 10, 346–363.
Wren, Y., Roulstone, S., Williams, A.L., 2010. Computer-based intervention. In: Williams, A.L., McLeod, S., McCauley, R.J. (Eds.), Interventions
for Speech, Sound, Disorders in Children. Paul H. Brookes, USA.

Rev Sistematica Fono Visrtual

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rev Sistematica Fono Visrtual

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Systematic review of virtual speech therapists for speech disorders夽

夽 This paper has been recommended for acceptance by K. Kirchhoff.

2. Virtual speech therapist characteristics

3. Previous literature reviews

1. What types of articulation and phonological disorders have VSTs addressed?

The eligibility criteria are detailed in the following subsection.

4.1. Eligibility criteria

4.1.1. Type of studies

4.1.2. Types of participants

4.1.3. Types of intervention

4.1.4. Types of outcome measures

4.2. Information sources

4.3. Search terms

4.4. Study selection and data collection

4.5. Data items

Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128

Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128

Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128

Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128

Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128

Y.-P.P. Chen et al. / Computer Speech and Language 37 (2016) 98–128

Fig. 2. Publication trend of the shortlisted studies.

5.1. Sample size

5.3. Type of studies

Fig. 6. Frequency of different types of disorders addressed by the shortlisted studies.

Fig. 7. Frequency of different types of intervention addressed by the shortlisted publications.

5.5.1. Training stimuli

5.5.2. Duration and frequency of therapy sessions

5.6. Outcome measures

Fig. 9. Frequency of the technological building blocks in the selected studies.

5.7. VST technological building blocks

5.7.1. Automatic speech recognition (ASR)

Fig. 10. Frequency of using different technologies on the disorders.

5.7.2. Facial feature tracking

5.7.3. Speech synthesizer

5.7.4. Expert systems

5.7.5. Speech corpus

5.7.6. Other technologies

5.8. Therapy delivery approaches

5.8.1. 3D virtual heads

5.8.2. Computer games

5.9. Support features

5.9.3. Personalized interface

6.1. Disorders addressed by the VSTs

6.2. Effectiveness of VSTs in therapy

6.3. Technological elements of VSTs

You might also like