You are on page 1of 4

2015 International Workshop on Pattern Recognition in NeuroImaging

Hidden Markov models for


reading words from the human brain
Sanne Schoenmakers1∗ , Tom Heskes2 , Marcel van Gerven1
1
Radboud University, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
2
Radboud University, Institute for Computing and Information Sciences, Nijmegen, The Netherlands

Corresponding author at: Radboud University, Donders Institute for Brain, Cognition and Behaviour,
Donders Centre for Cognition, P. O. Box 9104, 6500 HE Nijmegen, The Netherlands.
E-mail address: s.schoenmakers@donders.ru.nl. (S. Schoenmakers)

Abstract—Recent work has shown that it is possible to re- visual cortex a hidden Markov model (HMM) will be em-
construct perceived stimuli from human brain activity. At the ployed. HMMs have been successfully applied for resolving
same time, studies have indicated that perception and imagery words in handwritten language [9], [10] as well as spoken
share the same neural substrate. This could bring cognitive brain
computer interfaces (BCIs) that are driven by direct readout of language [11], [12]. They have also been used before to resolve
mental images within reach. A desirable feature of such BCIs dynamic changes in low-level perceptual states [13].
is that subjects gain the ability to construct arbitrary messages. In order to recover words from neural activity patterns,
In this study, we explore whether words can be generated from Gaussian mixture models (GMMs) are used to learn character-
neural activity patterns that reflect the perception of individual
specific priors that represent the shapes of handwritten char-
characters. To this end, we developed a graphical model where
low-level properties of individual characters are represented via acters [14]. This is combined with the use of an HMM in
Gaussian mixture models and high-level properties reflecting order to model character co-occurrences. The resulting prior on
character co-occurrences are represented via a hidden Markov handwritten character sequences is combined with a likelihood
model. With this work we provide the initial outline of a model term that models how perceived handwritten characters lead
that could allow the development of cognitive BCIs driven by
to changes in fMRI BOLD responses. The complete model is
direct decoding of internally generated messages.
Index Terms—fMRI, hidden Markov model, visual cortex, depicted in Figure 1 and can be used to recover the most likely
brain computer interface, brain decoding, language model sequence of handwritten characters. By using this integrated
approach we expect to improve the reconstruction of words
I. I NTRODUCTION from the human brain.
unigram bigram bigram bigram
Recent work has shown that it is possible to obtain accurate
i1 i2 in in+1
reconstructions of perceived stimuli from the brain for shapes
[1], [2], faces [3], handwritten characters [4], natural images
[5] and movies [6]. With the first steps in decoding of mental
X1 X2 xn xn+1
images [7] the idea of a cognitive brain computer interface that mi mi mi mi

is driven by direct read-out of internally generated messages


Ri Ri Ri Ri
could come within reach. One way to achieve this objective is yn yn+1
y1 y2 B
to reconstruct individual characters from neural activity pattens B B B

that together form words and, ultimately, whole sentences.    


Previous work on decoding of brain signals has shown that
it is important to incorporate prior knowledge in order to im- Fig. 1. Graphical representation of our GMM/HMM approach. Regression
prove reconstruction quality [5], [6], [8]. Furthermore, purely coefficients B and covariance Σ are estimated from training data and
parameterize a likelihood term. Gaussian mixture models use means mi and
discriminative models that do not make use of prior knowledge covariances Ri estimated from a separate set per category i to model the
have been shown to yield less accurate reconstructions [4]. In probability that an image x belonging to each of the categories. The hidden
this paper, we explore whether reconstruction of words (i.e. Markov model uses the category estimates for the characters in the word to
decode the entire word with Viterbi decoding. Unigrams and bigrams provide
sequences of perceived handwritten characters) from patterns constraints on character (co-)occurrences. The model returns the most likely
of brain activity can be improved by taking knowledge of word given observed brain responses y.
character co-occurrences into account. That is, we will use
prior knowledge of character pairs as they are found in English
language. In language some letters are typically followed by
II. M ETHODS
some but not all letters. For instance, The letter “C” is often
followed by “A” or “O”, but seldom by “X” or “G”. In the following, we will briefly summarize the GMM-based
In order to incorporate knowledge of letter sequences to decoding approach that has been developed in previous work.
facilitate word reconstruction from activity patterns in early Next, we generalize this approach by incorporating an HMM

978-1-4673-7145-2/15 $31.00 © 2015 IEEE 89


DOI 10.1109/PRNI.2015.31
for modeling character co-occurrences. Finally, we describe To prevent numerical issues a small positive number is added
the experimental data which was used to validate our approach. to all frequencies before normalisation.
Next, the goal is to find the most likely sequence of
A. GMM-based decoding
hidden states (i.e. character-sequence w = (i1 , . . . , iN ) ) that
Our goal is to learn a mapping between a stimulus x = results in a sequence of observed events (i.e. measured neural
(x1 , . . . , xp ) ∈ Rp and the associated measured response responses). This sequence is called the Viterbi path and it can
vector from the brain y = (y1 , . . . , yq ) ∈ Rq . That is, be computed using the Viterbi algorithm [16] (see e.g. [15]).
y = B x+ with B ∈ Rp×q a matrix of regression coefficients The Viterbi algorithm is a recursive algorithm which proceeds
and  zero mean normally distributed noise. The regression as follows:
coefficients are estimated using a standard L2 -regularised Initialise the iteration with
linear regression approach, as described in [8]. The likelihood
function is then given by V1 (i1 ) = log P (i1 ) + log P (y|i1 ) (8)
P (y|x) = N (y; B x, Σ) , (1) where P (i1 ) is a prior on unigrams (individual characters)
and P (y|i1 ) is the probability of the observed neural response
with diagonal covariance matrix Σ.
given the first character in the word (cf. Equation (6)).
Assume that the stimuli (e.g. handwritten characters) belong
Next, for n = 1, . . . , N − 1, compute
to different stimulus categories (e.g. letter classes). For the
prior distribution over stimuli x, we use a Gaussian mixture Vn+1 (in+1 ) = max Cn+1 (in , in+1 ) (9)
model, with a mixture component for each different stimulus in

category:  where
P (x) = P (i)N (x; mi , Ri ) , (2)
i Cn+1 (in , in+1 ) = Vn (in ) + log P (in+1 |in ) + log P (y|in+1 ) .
with P (i) the prior probability of category i, and mi and Ri (10)
the mean and covariance matrix, respectively, of its Gaussian After completing the recursion we can compute the most
mixture component. The means and covariances are estimated probable state of the final character in the sequence:
from a stimulus set whose exemplars are different from those i∗N = arg max VN (iN ) (11)
that are used to measure neural response patterns. iN
By applying Bayes’ rule, we can compute the MAP-estimate and backtrack in order to compute the most probable states of
from the prior and likelihood terms and obtain a reconstruction characters at positions N − 1, . . . , 1:
x given a brain response y. Following standard probabilistic
inference, see e.g. [15], we obtain: i∗n = arg max Cn+1 (in , i∗n+1 ) . (12)
 in
P (x|y) = P (i|y)P (x|y, i) . (3)
C. Experimental validation
i

We obtain the right-hand side components by applying Bayes’ We tested the model on a previously acquired and pre-
rule, yielding: processed fMRI dataset [8] to investigate the performance of
the HMM for word decoding. Three participants viewed 360
P (i)P (y|i) instances of handwritten characters out of six letter categories
P (i|y) =  (4)
j P (j)P (y|j) (B, R, A, I, N, S). The six words “barn”, “rain”, “bins”, “bras”,
and “sins” and “bars” were chosen and a posteriori formed by
P (y|x)P (x|i) selecting the corresponding brain responses from the test set.
P (x|y, i) = (5)
P (y|i) The test set consisted of 72 characters; twelve unique instances
for each of the six characters. A total of 20736 permutations
with 
of each word were formed by using all possible combinations
P (y|i) = dx P (y|x)P (x|i) . (6) of the twelve instances of the characters.
We investigated two cases of word decoding. Firstly, words
An explicit derivation of these equations is given in [4].
were decoded from the subset of six four-letter words with
B. HMM-based decoding accompanying unigram and bigrams learned on this subset of
Instead of specifying P (i) as a uniform probability distri- words. In this case also the categories of the characters were
bution over categories, we here implement a hidden Markov limited to the six characters categories that are represented
model (HMM) to define a prior over character sequences. in the dataset. Secondly, we evaluated word decoding with
We can form a bigram of letter sequences, where the bigram unigrams and bigrams based on the complete English lan-
indicates the probability of a character following another guage. Words were taken from a list of approximately 110,000
character, for all possible character categories: English words [17] to learn the bigrams and unigrams. In this
case the character categories were expanded to the complete
P (in+1 |in ) = P (in , in+1 )/P (in ) . (7) English alphabet during decoding.

90
A B
100 100
S01 S01
S02 S02
90 S03 90 S03

80 80

percentage correct words


70
percentage correct letters

70

60 60

50 50

40 40

30 30

20 20

10 10

0 0
baseline time-homog. HMM time-inhomog. HMM baseline time-homog. HMM time-inhomog. HMM

Fig. 2. Bar graphs for number of correct classifications over permutations of six four-letter words for all three subjects when a subset of words is used
to construct the prior. (A) The percentage of correctly classified letters. (B) The percentage of correctly classified words. The first set of bars in each graph
shows the performance for the baseline model. The second set of bars shows the performance for the time-homogeneous HMM. The third set of bars shows
the performance for the time-inhomogeneous HMM. Error bars indicate the standard error of the mean.

Three types of models were compared. The baseline model 10−8 ). Decoding performance improved by 5%, 12% and 8%
contains the estimation of the word calculated with only Gaus- for the time-inhomogeneous model compared to the time-
sian mixture models, i.e. with a uniform prior P (in+1 |in ) = homogeneous model for S01, S02 and S03 respectively (p <
P (in+1 ) = 16 . Secondly, we consider GMMs combined with 10−8 ). At chance level the letters should be decoded correctly
the HMM. Finally, for the combined approach we vary how 17% of the time for single characters and 8 · 10−4 % of the
the bigram is formed. That is, we used either a stationary time for whole words. Thus all models including the baseline
bigram that was independent of character position versus a model significantly surpassed chance-level performance.
position-specific bigram, resulting in different bigrams for Figure 3 depicts the results when the full English vocabulary
each letter position. These approaches are referred to as time- is used during model construction. Compared to the baseline
homogeneous versus time-inhomogeneous, respectively [18]. model, characters are correctly decoded 7% and 5% more
In order to quantify decoding performance the percentage often for S01 and S02, but performance decreases for S03
of correctly classified characters and words were calculated by 2% (p < 0.05). A weak increase is visible for the time-
for all subjects and for all words. Paired-sample t-tests were inhomogeneous model compared to the time-homogeneous
applied to assess the significance of our findings. Comparisons mode (p < 0.05). For the full English vocabulary the words
are made between the baseline model and between HMMs that are correctly identified 2%, 1% and 5% more often for S01,
used either the subset of words or the full English corpus. S02 and S03 respectively (p < 0.05), but no significant
difference is found for the time-inhomogeneous HMM in
III. R ESULTS contrast to the time-homogeneous HMM (p < 0.77). For the
Figure 2 shows character decoding performance for the full English vocabulary the chance-level performance is 4%
baseline model compared to that of the HMM in the time- for single characters and 3 · 10−6 % for words. Hence, all
homogeneous and time-inhomogeneous settings when using models show classifications that significantly exceed chance-
the subset of words. In the time-homogeneous setting charac- level performance.
ters are decoded correctly significantly more often compared
IV. C ONCLUSION
to the baseline model (p < 10−10 ), yielding improvements of
22%, 30% and 24% for S01, S02 and S03 respectively. The We introduced a graphical model, consisting of a hidden
time-inhomogeneous model showed significant improvements Markov model in combination with a Gaussian mixture model,
compared to the time-homogeneous model (p < 0.001), to decode letter sequences from brain activity patterns. Our
yielding improvements of 5%, 5% and 2% for S01, S02 and simulations show that, when our prior better matches the actual
S03 respectively. sequences, this indeed leads to an improvement of decoding
In case of whole-word decoding results were as follows. performance.
Whole words were decoded correctly 32%, 31% and 46% Word decoding has been done before on percepts of entire
more often by the time-homogeneous model compared to four-letter words [19]. This work has shown that words could
the baseline model for S01, S02 and S03 respectively (p < be distinguished at high accuracy when deciding between

91
A B
S01 S01
20 S02 20 S02
S03 S03
percentage correct letters

percentage correct words


15 15

10 10

5 5

0 0
baseline time-homog. HMM time-inhomog. HMM baseline time-homog. HMM time-inhomog. HMM

Fig. 3. Bar graphs for number of correct classifications over permutations of six four-letter words for all three subjects when the full English corpus is applied.
(A) The percentage of correctly classified letters. (B) The percentage of correctly classified words. The first set of bars in each graph shows the performance
for the baseline model. The second set of bars shows the performance for the time-homogeneous HMM. The third set of bars shows the performance for the
time-inhomogeneous HMM. Error bars indicate the standard error of the mean.

two possible choices. Here, we take a different approach by [9] A. Kundu, Y. He, and P. Bahl, “Recognition of handwritten word: first
showing that individual messages can be formed using HMM- and second order hidden Markov model based approach,” in Proceedings
CVPR ’88., Computer Society Conference on Computer Vision and
based decoding of individual characters, leading to a decision Pattern Recognition 1988., pp. 457–462, Jun 1988.
between six candidate words. [10] R. Bozinovic and S. Srihari, “Off-line cursive script word recogni-
tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
Concluding, our results indicate that the HMM/GMM ap- vol. 11, pp. 68–83, Jan 1989.
proach described in this paper can be a useful building block in [11] A. Ljolje and S. Levinson, “Development of an acoustic-phonetic hidden
the development of a cognitive BCI that relies on decoding of Markov model for continuous speech recognition,” IEEE Transactions
on Signal Processing, vol. 39, pp. 29–39, Jan 1991.
internally generated messages, providing an alternative to e.g. [12] J. Wilpon, L. Rabiner, C.-H. Lee, and E. Goldman, “Automatic recogni-
attention-based spellers in fMRI [20]. Ultimately, the viability tion of keywords in unconstrained speech using hidden Markov models,”
of this approach depends on the ability to decode imagined IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38,
pp. 1870–1878, Nov 1990.
rather than perceived characters, which we leave as a topic [13] M. A. J. van Gerven, P. Kok, F. P. de Lange, and T. Heskes, “Dynamic
for future research. decoding of ongoing perception,” NeuroImage, vol. 57, no. 3, pp. 950
– 957, 2011.
[14] S. Schoenmakers, M. A. J. van Gerven, and T. Heskes, “Gaussian mix-
R EFERENCES ture models improve fMRI-based image reconstruction,” in International
Workshop on Pattern Recognition in Neuroimaging 2014, pp. 1–4, IEEE,
[1] B. Thirion, E. Duchesnay, E. Hubbard, J. Dubois, J.-B. Poline, D. Lebi- 2014.
han, and S. Dehaene, “Inverse retinotopy: inferring the visual content [15] C. Bishop, Pattern Recognition and Machine Learning. Springer Verlag,
of images from brain activation patterns,” Neuroimage, vol. 33, no. 4, 2006.
pp. 1104–1116, 2006. [16] A. J. Viterbi, “Error bounds for convolutional codes and an asymptoti-
[2] Y. Miyawaki, H. Uchida, O. Yamashita, M. Sato, Y. Morito, H. C. cally optimum decoding algorithm,” IEEE Transactions on Information
Tanabe, N. Sadato, and Y. Kamitani, “Visual image reconstruction from Theory, vol. 13, no. 2, pp. 260–269, 1967.
human brain activity using a combination of multiscale local image [17] S. I. L. International Linguistics Department Dallas, “List of English
decoders,” Neuron, vol. 60, no. 5, pp. 915–929, 2008. words,” 1991.
[3] A. Cowen, M. M. Chun, and B. A. Kuhl, “Neural portraits of perception: [18] D. R. Cox and H. D. Miller, The Theory of Stochastic Processes,
Reconstructing face images from evoked brain activity,” Neuroimage, vol. 134. CRC Press, 1977.
vol. 94, pp. 12–22, 2014. [19] A. Gramfort, G. Varoquaux, B. Thirion, and C. Pallier, “Decoding visual
percepts induced by word reading with fMRI,” in International Work-
[4] S. Schoenmakers, U. Güçlü, M. A. J. Van Gerven, and T. Heskes,
shop on Pattern Recognition in NeuroImaging (PRNI) 2012, pp. 13–16,
“Gaussian mixture models and semantic gating improve reconstructions
July 2012.
from human brain activity,” Frontiers in Computational Neuroscience,
[20] B. Sorger, J. Reithler, B. Dahmen, and R. Goebel, “A real-time fmri-
vol. 8, no. 173, pp. 1–10, 2014.
based spelling device immediately enabling robust motor-independent
[5] T. Naselaris, R. J. Prenger, K. N. Kay, M. Oliver, and J. L. Gallant, communication,” Current Biology, vol. 22, no. 14, pp. 1333 – 1338,
“Bayesian reconstruction of natural images from human brain activity,” 2012.
Neuron, vol. 63, no. 6, pp. 902–915, 2009.
[6] S. Nishimoto, A. T. Vu, T. Naselaris, Y. Benjamini, B. Yu, and J. L.
Gallant, “Reconstructing visual experiences from brain activity evoked
by natural movies,” Current Biology, pp. 1–6, 2011.
[7] T. Naselaris, C. A. Olman, D. E. Stansbury, K. Ugurbil, and J. L. Gallant,
“A voxel-wise encoding model for early visual areas decodes mental
images of remembered scenes,” NeuroImage, vol. 105, pp. 215 – 228,
2015.
[8] S. Schoenmakers, M. Barth, T. Heskes, and M. A. J. van Gerven,
“Linear reconstruction of perceived images from human brain activity,”
NeuroImage, vol. 83, pp. 951–961, 2013.

92

You might also like