Professional Documents
Culture Documents
doi: 10.1093/jamia/ocaa174
Review
Review
Department of Theoretical and Applied Linguistics, University of Cambridge, Language Technology Lab, Cambridge, UK
Corresponding author: Ulla Petti, MPhil, Department of Theoretical and Applied Linguistics, University of Cambridge, Lan-
guage Technology Lab, 9 West Road, Cambridge CB3 9DP, UK (ump20@cam.ac.uk)
Received 4 February 2020; Revised 14 May 2020; Editorial Decision 6 July 2020; Accepted 14 July 2020
ABSTRACT
Objective: In recent years numerous studies have achieved promising results in Alzheimer’s Disease (AD) de-
tection using automatic language processing. We systematically review these articles to understand the effec-
tiveness of this approach, identify any issues and report the main findings that can guide further research.
Materials and Methods: We searched PubMed, Ovid, and Web of Science for articles published in English be-
tween 2013 and 2019. We performed a systematic literature review to answer 5 key questions: (1) What were
the characteristics of participant groups? (2) What language data were collected? (3) What features of speech
and language were the most informative? (4) What methods were used to classify between groups? (5) What
classification performance was achieved?
Results and Discussion: We identified 33 eligible studies and 5 main findings: participants’ demographic varia-
bles (especially age ) were often unbalanced between AD and control group; spontaneous speech data were col-
lected most often; informative language features were related to word retrieval and semantic, syntactic, and
acoustic impairment; neural nets, support vector machines, and decision trees performed well in AD detection,
and support vector machines and decision trees performed well in decline detection; and average classification
accuracy was 89% in AD and 82% in mild cognitive impairment detection versus healthy control groups.
Conclusion: The systematic literature review supported the argument that language and speech could success-
fully be used to detect dementia automatically. Future studies should aim for larger and more balanced data-
sets, combine data collection methods and the type of information analyzed, focus on the early stages of the
disease, and report performance using standardized metrics.
Key words: Alzheimer’s disease, dementia, natural language processing, speech, language
C The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association.
V
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1
2 Journal of the American Medical Informatics Association, 2020, Vol. 00, No. 0
Study examples tween healthy and AD group. Standard accuracy of over 81% was
In this section we briefly describe 2 studies to provide the reader achieved.
with a better understanding of what was examined. These 2 studies Clark and colleagues35 included both SVF and PVF tasks from
are chosen to cover different condition groups, data collection, and 107 MCI patients and 51 healthy control group participants. The tests
analysis methods. were transcribed, and language features, such as the raw count of
Fraser and colleagues8 used the recordings of 264 participants words, intrusions, repetitions, clusters, switches, mean word fre-
describing the Cookie Theft picture available on DementiaBank cor- quency, mean number of syllables, algebraic connectivity, and many
pus. Cookie Theft picture is a commonly used test in language and more were captured. The study paired linguistic measures with the in-
cognitive disorder assessment because it features a complex scene formation from magnetic resonance imaging (MRI) scans which
and describing it triggers diverse language. DementiaBank is a cor- allowed creating novel scores. The study concludes that the classifiers
pus available for research purposes that gathers speech and language trained on novel scores outperformed those trained on raw scores.
data from people with AD and other forms of dementia. The 2 par-
ticipant groups in Fraser’s study were AD and healthy control
Research questions
group. A total of 370 language and speech features related to part-
The research questions were grouped into 5 categories.
of-speech, syntactic complexity, grammatical constituents, psycho-
linguistics, vocabulary richness, information content, and repetitive-
ness and acoustics categories were extracted. The dataset was What were the characteristics of control and impaired groups?
divided into test and training data, and machine learning techniques In 33 studies, 32 different datasets were used. While some studies in-
were applied to explore the accuracy of automatic classification be- cluded up to 3 different datasets for different experiments, a few
4
1 Ammer and AD n ¼ 242 n ¼ 242 AD SS Repetition, word errors, – SVM, NN, Precision ¼
Ayed 201830 MLU morphemes, DT 79%
POS
2 Beltrami et al aMCI n ¼ 16, n ¼ 48 MCI SS Acoustic, lexical, syntac- – – –
201831 mdMCI n ¼ 16, tic
early dementia n ¼ 16
3 Boye et al AD n ¼ 5 n¼5 AD SS Lexical and semantic – – –
201432 deficit, reduced con-
versation
4 Chien et al 1) AD n ¼ 15 1) n ¼ 15 AD SS Speech length, non-si- 150 samples from RNN AUC ¼
201833 2) AD n ¼ 30 2) n ¼ 30 lence tokens 60 participants 0.956
5 Clark et al MCI n ¼ 23, n ¼ 25 AD, MCI SVF, PVF Semantic similarity of – – –
201434 AD n ¼ 10 words
6 Clark et al MCI-con n ¼ 24, MCI- n ¼ 51 MCI SVF, PVF Coherence, lexical fre- 158 Random Acc ¼ 81–
201635 non n ¼ 83 quency, graph theo- forest, 84%
retical measures SVM,
NB, MLP
7 Fang et al MCI-con n ¼ 1 n¼2 MCI SS Unique and specific – – –
201729 words, grammatical
complexity
8 Fraser et al AD n ¼ 167 n ¼ 97 AD SS Semantic, acoustic, syn- 473 samples from, Logistic re- Acc ¼ 81%
20168 tactic, information 264 partici- gression
pants
9 Garrard et al Probable AD n ¼ 5 n¼0 AD SS – – – –
201728
10 Gosztolya et al MCI ¼ 48 n ¼ 36 MCI SS Filled pauses 84 SVM Acc ¼
201636 88.1%
11 Gosztolya et al MCI n ¼ 25, early AD n n ¼ 25 AD, MCI SS Semantic, morphologi- 75 SVM Acc ¼ 86%
201937 ¼ 25 cal, acoustic attrib-
utes
12 Guinn et al AD n ¼ 28 n ¼ 28 AD SS Ratio, POS, lexical, 56 NB, DT precision ¼
201438 pauses, fillers 80%
13 Hernandez- AD n ¼ 169 n ¼ 74 AD, MCI SS Information coverage, 236 training, 26 SVM, Ran- Acc ¼ 87–
Dominguez MCI n ¼ 19 auxiliary verbs, hapax testing dom For- 94%
et al 201839 legomena est
14 Khodabakhsh AD n ¼ 27 n ¼ 27 AD SS Log voicing ratio, aver- 54 SVM, DT Acc ¼ 88–
et al. age absolute delta for- 94%
2014a40 mant and pitch
15 Khodabakhsh AD n ¼ 20 n ¼ 20 AD SS Fillers, unintelligibility, 40 SVM, DT Acc ¼ 90%
et al. no of words, confu-
2014b41 sion, pause & no an-
swer rate
Journal of the American Medical Informatics Association, 2020, Vol. 00, No. 0
(continued)
16 Khodabakhsh AD n¼28 n¼51 AD SS Ratio, POS, speech rate 79 SVM, NN Acc ¼ 84%
et al 201542 features NB,
CTree
17 Konig et al MCI n¼23, AD n¼26 n¼15 AD, MCI SS, SVF, OT Speech continuity, ratio 64 SVM EER ¼ 13–
201543 21%
18 Konig et al AD n¼27, mixed de- n¼0 AD, MCI SS, SVF, Location of first word, 165 SVM Acc ¼ 86%
201827 mentia n¼38, MCI PVF, OT words’ distribution in
n¼44, SCI n¼56 time
19 Lopez-de-Ipina AD n ¼ 20 n ¼ 20 AD SS Impoverished vocabu- 40 MLP Acc ¼ 75–
et al lary, limited replies 94.6%
2013a44
20 Lopez-de-Ipina Early AD n ¼ 1 n¼5 AD SS Fluency, acoustic 10 SVM, MLP, Acc ¼
et al Intermediate AD n ¼ 2 kNN, 93.79%
2013b45 Advanced AD n ¼ 2 DT, NB
21 Lopez-de-Ipina AD n ¼ 20 n ¼ 20 AD SS Duration, time, fre- 40 MLP, KNN Acc ¼ 95%
et al 201546 quency
22 Lopez-de-Ipina 1) AD n ¼ 6, 1) n ¼ 12, AD, MCI SS, SVF Voicing, pauses, F0, har- 18, 40, 100 MLP, CNN Acc ¼ 73–
et al 201847 2) AD n ¼ 20, 2) n ¼ 20 monicity 95%
3) MCI n ¼ 38 3) n ¼ 62
23 Luz 20187 Nr of recordings n ¼ 184 AD SS Vocalisation, speech 398 NB Acc ¼ 68%
reported record- rate, number of utter-
AD n ¼ 214 recordings ings ances across discourse
event
Journal of the American Medical Informatics Association, 2020, Vol. 00, No. 0
24 Martinez de 1) AD n ¼ 6, 1) n ¼ 12, AD, MCI SS Voicing, pauses, F0, har- 18, 40, 100 kNN, SVM, Acc ¼ 80–
Lizarduy 2) AD n ¼ 20, 2) n ¼ 20 monicity MLP, 95%
et al 201748 3) MCI n ¼ 38 3) n ¼ 62 CNN
25 Martinez-San- Possible AD n ¼ 45 n ¼ 82 AD OT Syllable intervals and – – AUC ¼ 0.87
chez et al their variation
201649
26 Mirzaei et al Early AD n ¼ 16, MCI n n ¼ 16 AD, MCI OT HNR, voice length, 48 kNN, SVM, –
201850 ¼ 16 silences DT
27 Rentoumi et al AD n ¼ 30 n ¼ 30 AD OT – – NB, SVM Acc ¼ 89%
201751
28 Sadeghian et al AD n ¼ 26 n ¼ 46 AD SS Long pauses, pause and 65 training, 7 test- MLP Acc ¼
201752 speech duration ing 94.4%
29 Satt et al 20139 MCI n ¼ 43, AD n ¼ 27 n ¼ 19 AD, MCI SS, OT Verbal reaction time, 89 SVM EER ¼
voiced segments 15.5–
18%
30 Toth et al MCI n ¼ 32 n ¼ 19 MCI SS Pauses, tempo 153 samples from SVM, Ran- Acc ¼
201553 51 participants dom For- 82.4%
est
(continued)
5
error rate; GCNN, gated convolutional neural networks; HNR, harmonics-to-noise ratio; kNN, k-nearest neighbor; MCI, mild cognitive impairment; MCI-con, mild cognitive impairment later converted into AD; MCI-
non, mild cognitive impairment not converted into AD; MD, mixed dementia; mdMCI, multiple domain mild cognitive impairment; MLP, multilayer perceptron; MLU, mean length of utterance; NB, Naive Bayes; OT, other
Abbreviations: Acc, accuracy; AD, Alzheimer’s disease; aMCI, amnestic mild cognitive impairment; AUC, area under curve; CNN, convolutional neural networks; CTree, classification tree; DT, decision tree; EER, equal
datasets were used more than once across the studies. The condi-
tion perfor-
Acc ¼ 75%
Classifica-
mance
73.6%
tions considered in this study were AD and MCI. Although MCI did
Acc ¼
not feature in the search terms, we decided not to exclude the studies
focusing solely on MCI because while MCI patients do not meet the
–
diagnostic criteria of dementia, they can sometimes convert to AD.
dom For- The studies may therefore provide an insight into the early stages of
est, SVM
Classifica-
Logistic re-
tion algo-
gression
NB, Ran-
rithm
GCNN patients who develop AD and of those who do not. To address the
heterogeneity this approach creates, the studies focusing on MCI are
267 partici-
64% of all studies reported participants’ gender and age. The av-
erage number of male participants was 35, and female participants
pants
was 50. The number of male and female participants was stated to
84
tasks; POS, part-of-speech; SCI, subjective cognitive impairment; SS, spontaneous speech; SVF, semantic verbal fluency; SVM, support vector machine.
Pausation, tempo and
speech 2010
tion level was considered in 45% of the studies. The control group
duration
SS
SS
SS
AD
n ¼ 98
n ¼ 38
AD n ¼ 169
AD n ¼ 48
Toth et al
201854
201855
201656
31
32
33
Abbreviations: AD, Alzheimer’s disease; aMCI, amnestic mild cognitive impairment; MCI, mild cognitive impairment; MCI-con, mild cognitive impairment
later converted into AD; MCI-non, mild cognitive impairment not converted into AD; MD, mixed dementia; mdMCI, multiple domain mild cognitive impair-
ment; SD, standard deviation.
What language and speech features were the most informative? study or research group with significant results,58 each feature that
The 33 studies included experiments from 21 individual research has been reported the most informative by at least 1 research group
groups. Out of the individual research groups, 18 included SS tasks is reported on equal basis. See Figure 3 for the most informative lan-
and 5 VF and OT tasks. The most informative language and speech guage features from SS, VF, and OT tasks.
features are looked at in 2 categories: those characteristic to AD,
and those to MCI.
The number of the language and speech features used in the anal-
yses ranged from 4 to 920. As the studies with a large number of fea- What methods were used to classify healthy people and the people
tures did not report all the features considered, it was difficult to with dementia?
examine what features were studied the most extensively. To avoid Out of 33 studies, 27 used ML to distinguish between healthy people
the synthesis bias towards the features that have been studied and the people with different medical conditions. Different ML
more57 and the multiple publication bias of over-representing 1 algorithms were used across studies: NNs were used in 17 studies,
8 Journal of the American Medical Informatics Association, 2020, Vol. 00, No. 0
AD MCI
inflected verbs width of parsing tree
reported speech ratio depth:width of parsing tree
Speech or utterance
interpolated clauses go-ahead utterances
duration or length syntactic complexity
question rate past
depth of parsing tree past particles
Ch
Chaos
pronoun:noun
conjunctions
pronoun frequency
prepositions
adverbs
open:closed class words ratio
word
auxiliaries
verb frequency / rate
number of nouns / noun rate information coverage POS retrieval
determiners
adjective rate
adverbs
semantic attributes
semantic processing
AD MCI
algebraic connectivity
radius graph theory
transitivity
F0 SD
Military
harmonicity acoustic impairment
voicing activation
Revolution
VF
Tradition
coherence
semantic similarity semantic processing
between words
AD MCI
Chaos
errors per token errors
Figure 3. Most informative language and speech features across SS, VF, and OT tasks (AD, Alzheimer’s Disease; MCI, mild cognitive impairment; OT, other tasks;
POS, Part-of-Speech; SD, Standard Deviation; SS, spontaneous speech; VF, verbal fluency).
Journal of the American Medical Informatics Association, 2020, Vol. 00, No. 0 9
SVMs in 16, DTs in 11, Naı̈ve Bayes in 7, and logistic regression in considered whether the participants were right- or left-handed. Simi-
2 studies. See Table 3 for details and definitions. larly, the majority of participants spoke European languages, lead-
ing to very few non-European languages being considered.
Two popular and well-performing language tasks were SS and
What classification performance has been achieved?
VF. Promising results were achieved using language features relating
The studies reviewed in this paper tend to use different measures to
to word retrieval, semantic and acoustic impairment, and error rate.
report classification performance (accuracy, precision, Area Under
Various ML algorithms were used to classify between different
Curve Receiver Operating Characteristics (AUC – ROC), making
condition groups. The best performing models were NNs, SVMs,
the comparison of performance difficult. Standard accuracy refers to
and DTs.
the level of agreement between the reference value and the test re-
Table 3. Details of ML methods used and the performance achieved. “Average of all reported outcomes” refers to the average of all measures reported across studies using the ML algorithm
and performance measure. “Average of best reported outcomes” takes the average measure of the best performance reported in each study (1 measure per study) using the ML algorithm and
performance measure
performance measure acc (n) AUC (n) precision (n) EER (n) acc (n) AUC (n) precision (n) EER (n)
Neural Nets (NNs) (n ¼ 17) average of all reported outcomes 86% (17) – 0.64 (4) – 65% (4) – – –
NNs are computer systems that are average of best reported outcomes 88% (6) 0.96 (1) 0.69 (1) – 69% (2) – – –
similar in structure to biological
neural networks and mimic the
way animals learn59
Support Vector Machines (SVMs) (n average of all reported outcomes 81% (33) – 0.68 (4) 14% (2) 78% (13) – – 19% (3)
¼ 16) average of best reported outcomes 88% (9) – 0.79 (1) 14% (2) 77% (6) – – 19% (2)
SVMs are supervised models that use
training data belonging to 1 or an-
other category, and assign a cate-
gory to test data based on the
training data60
Decision Trees (DTs) (n ¼ 11) average of all reported outcomes 82% (24) – 0.63 (5) – 78% (9) – – –
DTs take several input variables and, average of best reported outcomes 90% (4) – 0.96 (2) – 80% (3) – – –
based on the observations about
them, predict the value of a target
variable61
Naı̈ve Bayes (NB) (n ¼ 7) average of all reported outcomes 81% (8) – 0.81 (1) – 67% (1) – – –
NB classifiers are simple probabilis- average of best reported outcomes 81% (4) – 0.81 (1) – 67% (1) – – –
tic classifiers where each variable
contributes independently to the
assigning of class label62
Abbreviations: acc, accuracy; AD, Alzheimer’s disease; AUC, area under curve; CD, cognitive decline; EER, equal error rate.
Journal of the American Medical Informatics Association, 2020, Vol. 00, No. 0
1 Ammer and Ayed 201830 feature selection: kNN; classifier: SVM precision ¼ 79%
2 Beltrami et al 201831 Acoustic features –
3 Boye et al 201432 – –
4 Chien et al 201833 bidirectional LSTM RNN AUC ¼ 0.956
5 Clark et al 201434 Semantic similarity features –
6 Clark et al 201635 Classifiers with novel scores including MRI data Acc ¼ 81–84%
7 Fang et al 201729 length of sentence, unique words, non-specific, and specific –
Abbreviations: Acc, accuracy; AD, Alzheimer’s disease; CNN, convolutional neural networks; DT, decision trees; ET, emotional temperature; kNN, k-nearest
neighbor; LSTM RNN, long short-term memory recurrent neural network; MLP, multilayer perception; MMSE, mini-mental state examination; MRI, magnetic
resonance imaging; RFC, random forest classifier; SVM, support vector machine.
communication (gestures), 16) include syllable-timed and low- There are 5 main limitations, 4 of which contributed to the
resource languages, 17) replicate the results of currently available decision to rate down the outcome confidence level from high to
studies, 18) evaluate the temporal change and the severity of the dis- moderate.
ease, and 19) include more forms of dementia, such as vascular de- First, the chance of publication bias must be acknowledged,
mentia. meaning that only the studies with more significant results might
have been published.65,66 Although publication bias was undetected
Study limitations in the current review, it is especially common in literature reviews
To evaluate the limitations and establish the confidence level of the written in the early stages of the specific research area due to
outcomes, we adapt GRADE guidelines.64 negative studies being delayed66 and should therefore be mentioned.
12 Journal of the American Medical Informatics Association, 2020, Vol. 00, No. 0
Potential publication bias was not used to decrease the confidence AUTHOR CONTRIBUTIONS
level.
SB and AK contributed to the conception of the manuscript. UP per-
Second, there is a potential synthesis bias in the study location,
formed article collection and examination, data summarization and
as only articles written in English were included.57,58 This did not al-
analysis, and drafted the manuscript. SB contributed significantly to
low for the data available in other languages to be considered, limit-
article screening and data analysis and revised and edited the manu-
ing our dataset and possibly contributing to the small number of
script. AK provided research direction, commented on the manu-
non-European languages being included. Language bias can espe-
script, and approved the final version of the manuscript.
cially affect the outcomes relating to the most informative language
features, as these are directly dependent on the language used.
18. Forbes KE, Venneri A, Shanks MF. Distinct patterns of spontaneous in managed-care facilities. In: 2014 IEEE Symposium on Computational
speech deterioration: an early predictor of Alzheimer’s disease. Brain Intelligence in Healthcare and e-Health, Proceedings; December 9–12,
Cogn 2002; 48 (2–3): 356–61. 2014; Orlando, FL.
19. Rodrıguez-Aranda C, Waterloo K, Johnsen SH, et al. Neuroanatomical 39. Hern andez-Domınguez L, Ratte S, Sierra-Martınez G, et al. Computer-
correlates of verbal fluency in early Alzheimer’s disease and normal aging. based evaluation of Alzheimer’s disease and mild cognitive impairment
Brain Lang 2016; 155-156: 24–35. patients during a picture description task. Alzheimer’s Dement 2018; 10:
20. Venneri A, Jahn-Carta C, De Marco M, et al. Diagnostic and prognostic 260–8.
role of semantic processing in preclinical Alzheimer’s disease. Biomark 40. Khodabakhsh A, Kuscuoglu S, Demiroglu C. Detection of Alzheimer’s dis-
Med 2018; 12 (6): 637–51. ease using prosodic cues in conversational speech. In: 22nd Signal Process-
21. Garrard P, Carroll E. Lost in semantic space: a multi-modal, non-verbal ing and Communications Applications Conference; April 23–25, 2014;
57. Booth A, Papaioannou D, Sutton A. Systematic Approaches to a Success- 63. ISO 5725-1:1994: Accuracy (trueness and precision) of measurement meth-
ful Literature Review. London: SAGE; 2016. ods and results. Part 1: general principles and definitions. https://www.iso.
58. Song F, Parekh S, Hooper L, et al. Dissemination and publication of re- org/obp/ui/#iso:std:iso:5725:-1:ed-1:v1:en Accessed April 17, 2020.
search findings: and updated review of related biases. Health Technol As- 64. Guyatt G, Oxman AD, Akl EA, et al. GRADE guidelines: 1. Introduc-
sess 2010; 14 (8): iii–193. tion—GRADE evidence profiles and summary of findings tables. J Clin
59. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in ner- Epidemiol 2011; 64 (4): 383–94.
vous activity. Bull Math Biophys 1943; 5 (4): 115–33. 65. Rosenthal R. The file drawer problem and tolerance for null results. Psy-
60. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995; 20 (3): chol Bull 1979; 86 (3): 638–41.
273–97. 66. Guyatt G, Oxman AD, Montori V, et al. GRADE guidelines: 5.
61. Quinlan J. Induction of decision trees. Mach Learn 1986; 1 (1): 81–106. Rating the quality of evidence—publication bias. J Clin Epidemiol 2011;