Tracking Statistical Learning Online Word Segmentation in A Target Detection Task

Acta Psychologica 215 (2021) 103271
Contents lists available at ScienceDirect
Acta Psychologica
journal homepage: www.elsevier.com/locate/actpsy
Tracking statistical learning online: Word segmentation in a target

detection task
Krisztina Sára Lukics *, Ágnes Lukács
Department of Cognitive Science, Budapest University of Technology and Economics, Budapest, Egry József utca 1, H-1111, Hungary
MTA-BME Momentum Language Acquisition Research Group, Eötvös Loránd Research Network (ELKH), Budapest, Egry József utca 1, H-1111, Hungary
A R T I C L E I N F O A B S T R A C T
Keywords: Despite the essential role of statistical learning in shaping human behavior, there are still controversies con
Statistical learning cerning its measurement. In this paper, we present a novel online target-detection task in an acoustic word
Word segmentation segmentation paradigm, which is able to track the process of learning and does not build on deliberation and
Online target detection
decision making. Beside testing the novel online task, we also examined its relationship with two offline mea
Two-alternative forced choice task
Statistically-induced chunking recall task
sures: the traditional two-alternative forced choice (2AFC) task, and the statistically-induced chunking recall
(SICR) task (Isbilen et al., 2017). Participants showed a significant learning effect on the online task, reflected in
PsycINFO classification codes:
2343 (Learning & Memory)
the decrease of reaction times during training and in the differences between reaction times to predictable versus
2220 (Tests & Testing) unpredictable targets. Online learning scores correlated with the 2AFC scores, but this association was only
present when participants did not have explicit knowledge about stimuli. SICR scores were not associated with
any of the other measures. The internal consistency was higher for online learning measures than for the other
two tasks. These findings show that the online target detection task is a good tool for assessing statistical
learning, and invite further research on its psychometric properties.
1. Introduction co-occurrence of events is called statistical learning (SL) (e.g., Frost,

Armstrong, Siegelman, & Christiansen, 2015).
Our world is full of ambiguities, yet we rarely pause in confusion SL is tested in several ways in different domains, but the most widely
before we interpret stimuli in the environment. For instance, when we used method for assessing SL is the word segmentation paradigm. In this
hear an utterance like “thebabytalks”, we tend to segment it as “the baby task, participants are presented with a continuous stream of artificial
talks” instead of, for instance, “theb abyt alks” or many other possibil words with no cues for word boundaries other than the statistical
ities. This ability is essential for language processing as natural speech properties of the stimulus: transitional probabilities (TPs) are higher
lacks pauses as reliable cues for word boundaries. One of the reasons within words (e.g. between the first and second syllable of the same
why we are able to segment this phrase into words is that in the course of word) than across word boundaries (between the last syllable of one
linguistic development, we hear “the” many times and in numerous word and the first syllable of the following one). While we use the terms
contexts, while we meet the segment “theb” less frequently, only if a speech stream and words here as the paradigm is most widely tested in the
word with a first phoneme “b” follows the word “the”. By repeated acoustic linguistic domain, the segmentation task is used in other, visual
exposure to words like “the”, “baby” and “talks”, that is, by the relatively and nonlinguistic domains, as well. Generally, participants are not
frequent co-occurrence of the phonemes and syllables within the words informed about the presence of the underlying structure and are not
as opposed to across word boundaries, we learn to extract these as units instructed to learn, yet after the training phase, they show evidence of
from the speech stream. This shows that we have acquired something implicitly acquiring some knowledge of the regularities.
about the structure of our environment, in this case, of language input, The great majority of segmentation studies apply the method of the
which is ambiguous for the naïve observer. That is, we have learnt to original paradigm by Saffran and her colleagues (Saffran, Aslin, &
segment and perceive “the”, “baby” and “talks” as words instead of Newport, 1996; Saffran, Newport, & Aslin, 1996) with minor modifi
“theb”, “abyt” and “alks”. This process of extracting patterns and regu cations. This SL paradigm consists of a familiarization and a test phase.
larities from the environment based on multiple exposures to events and After familiarization with the speech stream, in the test phase,
* Corresponding author at: Department of Cognitive Science, Budapest University of Technology and Economics, Budapest, Egry József utca 1, H-1111, Hungary.
E-mail addresses: lukics.krisztina.sara@ttk.bme.hu (K.S. Lukics), lukacs.agnes@ttk.bme.hu (Á. Lukács).
https://doi.org/10.1016/j.actpsy.2021.103271
Received 5 June 2020; Received in revised form 8 February 2021; Accepted 10 February 2021
Available online 22 March 2021
0001-6918/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
K.S. Lukics and Á. Lukács Acta Psychologica 215 (2021) 103271
participants are shown legal words from the learning phase and part- Batterink and Paller (2017), RTs were measured following a separate
words or non-words from the same syllable set and with the same learning phase. This way, while making the familiarization phase
length. In judgement tasks and two-alternative forced choice (2AFC) identical to those in traditional SL tasks, they applied post-hoc testing,
tasks (in the case of adults) or in head-turn preference paradigms (in the and measured learning at one point only. Item-detection tasks, however,
case of infants), participants are able to discriminate between words and also give an opportunity to track the process of learning online. Based on
part-words or non-words above chance level.1 This above-chance per the Serial Reaction Time Task (Nissen & Bullemer, 1987), Hunt and
formance has been regarded as evidence that SL has occurred. However, Aslin (2001) created a series of segmentation tasks where light patterns
although judgement and forced-choice tasks are the most widely used on a response box followed a statistical pattern and triplets of light pairs
tests for SL, there have been concerns about their psychometric suit formed “words”. Participants had to press lit buttons. As a measure of
ability, and therefore, their accurateness. This is especially problematic learning, reaction times of button presses to predictable versus unpre
when testing individual differences in SL, which, like in other fields of dictable items were contrasted. This task revealed the acquisition of
cognitive science, have recently gained significant attention (particu statistical patterns in a domain and modality different from the tradi
larly in the case of investigating the relationship between linguistic skills tional word segmentation paradigms, but the findings motivate tracking
and SL, e.g., Conway, Bauernschmidt, Huang, & Pisoni, 2010; Kidd, SL online with other stimulus arrangements as well. In the verbal
2012; Misyak & Christiansen, 2012; but see Kaufman et al., 2010, for acoustic domain, Batterink (2017) examined the process of SL in a word
relationship of SL with intelligence and personality traits, and Siegelman segmentation task by comparing RTs for predictable and unpredictable
& Frost, 2015, for methodological considerations). syllables through numerous short streams. While the psychometric
Psychometric concerns about judgement or forced-choice measures properties and the accurateness of these tasks are yet to be tested, the
point out that in adults, these show no, or only weak associations with less explicit nature and certain findings (for example, the relationship
other measures of SL on the same learning material (e.g., Batterink & between neural entrainment during learning and post-learning RT
Paller, 2017; Batterink, Reber, Neville, & Paller, 2015; Franco, Eberlen, scores in the study of Batterink, 2017) are promising.
Destrebecqz, Cleeremans, & Bertels, 2015; Isbilen, McCauley, Kidd, & The statistically-induced chunking recall (SICR) task by Isbilen et al.
Christiansen, 2017), they show low internal consistency (Arnon, 2019; (2017) is the result of a targeted attempt to find better measures for
Siegelman, Bogaerts, Elazar, Arciuli, & Frost, 2018), and modest sta individual differences in SL. In this task, after familiarization, partici
bility over time (Arnon, 2019; Siegelman & Frost, 2015) (however, in pants have to recall strings that are either structured by the statistical
ternal consistency and test-retest reliability is higher in the case of properties encountered during the preceding training phase, (i.e. strings
nonlinguistic SL tasks, Arnon, 2019; Siegelman, Bogaerts, & Frost, 2017, that consist of words), or unstructured strings which contain syllables of
Siegelman, Bogaerts, Elazar, et al., 2018). Moreover, test-retest reli the familiarization phase in random order. Results showed that short-
ability is extremely weak in the case of children (Arnon, 2019). There term memory performance for structured items was better than for un
are also theoretical considerations against the use of post-exposure structured ones. When assessing the test-retest reliability of the measure,
explicit tasks. As deliberation and decision making processes are Isbilen and her colleagues found that recall performance for structured
inherent in judgement tasks, noise deriving from variation in other items and random items both had excellent stability over time; the dif
cognitive abilities involved in explicit judgments might be large in these ference score, however, was less stable. They further examined the
task scores, questioning their validity in measuring SL. Furthermore, relationship between SICR performance for structured items and 2AFC
metalinguistic instructions might be challenging for children, presenting scores, and found no correlation between the two measures. The authors
a further potential confound in assessing their SL capacity. argue that since SICR proved to be reliable measure of SL, and 2AFC is
These arising concerns and the effort to find better targeted tasks has not reliable, SICR is able to capture individual differences in SL ability.
resulted in a diverse set of paradigms providing an opportunity to find a As both target detection and SICR paradigms are good candidates for
more proper measure for SL. The target detection paradigm was adapted being more targeted measures of SL, which could make them suitable for
to verbal statistical learning by Franco et al. (2015). This task requires more accurate group level testing as well as assessing individual dif
participants to press a button whenever they hear/see a given item in the ferences, we need detailed studies of both measures, together with
stream. As participants proceed to extract the underlying structure of developing new tasks which might better meet these expectations. The
stimuli, they become increasingly better in processing the next element aim of the present study is to introduce a new target detection paradigm
in linguistic sequences, resulting in quicker button presses. Franco and which examines statistical learning online. We wanted to develop a new
colleagues demonstrated learning with this measure in a post-training task 1) to assess learning online, getting insight into the process of
task: after familiarization, in a set of consecutive short streams, they learning over time, not only in a post-hoc manner; 2) to have a measure
compared RTs for syllables that were either predictable (second or third that does not build on deliberation and decision making, and thus
syllables of a word) or unpredictable (first syllables of a word). They measures SL with less influence and noise from other cognitive capac
found a meaningful diversity in these RT scores, which did not correlate ities; 3) and to obtain a task which is not metalinguistic and this way,
with 2AFC scores. In a similar design, Batterink et al. (2015) also found more suitable for children. In Experiment 1, we wanted to examine the
no correlation between the 2AFC measures and RT measures from their relationship between online target detection measures and scores from
target detection task. SICR and 2AFC tasks testing the knowledge of the same training mate
In a subsequent study, Batterink and Paller (2017) also used a post- rial. By examining the relationship between online target detection
familiarization item-detection task in their word segmentation para scores and SICR and 2AFC scores, we wanted to test the validity of this
digm. They found that across all their measures, RT scores correlated new measure, as well as to contribute to the search for a more targeted
with degree of neural entrainment throughout the familiarization phase, measure of SL. We hypothesized 1) that we will find evidence of learning
which they assumed to be the most direct indicator of learning. in online RT changes in the target detection task, 2) that the measures
Although the item-detection RT measure and the explicit measures were from the online item-detection task and the SICR task would show a
correlated, no associations were found between entrainment changes significant correlation, as they are less influenced by other cognitive
and explicit SL measures. domains than the 2AFC task, and 3) that the relationship between the
In the studies by Franco et al. (2015), Batterink et al. (2015), and 2AFC task and the other two measures will be more modest.
1
In the case of adults, there may be small differences in the test: they can be
instructed to give their responses based on preference, familiarity, or the word
status of the test sample.
2
2. Experiment 1 Table 2
Sequences in the SICR task by type.
2.1. Method Word Part-word Non-word
cégávidetamü tucégákideta tavimügátacé

2.1.1. Participants sápetuhagoki müsápevihago pekihaságotu
27 participants (mean age: 25.84, SD = 3.95, 15 females, 2 left- detamüsápetu videtakisápe tumütapesáde
handed) volunteered in the experiment recruited through convenience hagokicégávi tuhagomücégá vigohacékigá
sampling. All were first language speakers of Hungarian. Before the cégávisápetu kicégámüsápe vipecétuságá
sápetudetamü visápekideta pemütusátade
experiment, all of them gave an informed consent, in accordance with detamühagoki tudetavihago gotakimühade
the principles set out in the Declaration of Helsinki and the stipulation of hagokisápetu mühagovisápe kituhaságope
the local IRB. The study was approved by the United Ethical Review
Committee for Research in Psychology (EPKEB). Participants were
informed about the purpose of the research after the study. on the design of the Serial Reaction Time Task (Nissen & Bullemer,
1987), aimed at tapping into the process of learning. During listening to
2.1.2. Stimuli the speech stream participants had to respond to a target syllable. This
Stimuli in the experiment were based on the original speech seg target syllable was the last syllable of one of the four trisyllabic words,
mentation studies of Saffran and colleagues (Saffran, Aslin, & Newport, counterbalanced across participants. Last syllables were chosen as tar
1996; Saffran, Newport, & Aslin, 1996) with adaptations to Hungarian gets because previous work has shown that in target detection tasks
and to the specific aims of the task. They consisted of 12 digitally participants show the most reliable learning effect in RT decreases to the
recorded nonsense consonant-vowel syllables (cé /ʦːe/, de /dε/, gá /gaː/ last syllables of words in speech segmentation tasks (as opposed to
, go /go/, ha /hɒ/, ki /ki/, mü /my/, pe /pε/, sá /ʃaː/, tu /tu/, ta /tɒ/, vi middle-position syllables; Batterink et al., 2015; Franco et al., 2015). In
/vi/) spoken by a male Hungarian speaker. Syllables were recorded in the online task, the training phase consisted of three streams, or three
isolation to prevent coarticulation effects, and were manipulated in training blocks (Blocks TRN1 to TRN3). Each training stream consisted
Praat so that all recordings had a pitch of 130 Hz, and a length of 400 ms. of 60 words, so altogether, 180 words were presented during training;
Maximum intensities of recorded syllables ranged from 76.6 to 83.5 dB. each of the four words was presented 45 times. In these three training
Sound intensity of syllables was judged to be equal by two independent blocks we measured how effectively participants predict the target syl
listeners. These manipulations and tests were applied in order to elim lable through the course of learning, reflected in the changes in reaction
inate any potential acoustic cues to word boundaries. For the online times and accuracies over time. We also included two test blocks in the
training phase, these syllables were structured into five streams. In four online task. The first test block was the random stream (disrupting the
of these streams, the 12 syllables were organized into four words (cégávi, word structure of previous streams with syllables following each other in
detamü, hagoki, sápetu), and these words followed each other in a a pseudorandom order, Block RND). This random stream was the
pseudorandom order with the constraint that the same word could not baseline or reference block, as in this block the target syllable was not
occur twice in a row (word streams). In these word streams, TPs within predictable, so RTs and accuracies were only influenced by the general
words were 1, while TPs at word boundaries were 0.33. In a fifth stream, practice in the task. The second test block was a word stream again,
the individual syllables followed a pseudorandom order in a way that serving as a recovery block (Block REC). Throughout the five blocks, the
two syllables could not occur twice in a row (random stream). In all the task was the same, participants performed an item-detection task: they
streams, syllables were presented with pauses of 100 ms in between, were instructed to respond by pressing the space bar every time they
resulting in a presentation rate of 500 ms. Participants had no cues for heard the target syllable (specified in the instruction) as quickly and
word boundaries except for the TPs. In each stream, 180 syllables were accurately as they could, and to press the key ‘A’ in the case of any other
presented, each syllable with an equal frequency of occurring 15 times syllable.
within each block, both in word streams and in the random stream. After the online task, two other offline measures were collected: one
For the 2AFC task, we used the four words from the training streams, from a 2AFC task, and another from a SICR task. To eliminate possible
and we created four part-word (syllable triplets spanning a word task order effects, the order of the two offline tasks was counterbalanced
boundary in the word streams), and four non-word (syllable triplets not across participants. In the 2AFC task, participants were presented with
occurring together in the word streams) foils. Stimuli for the task are 24 pairs of tri-syllabic strings. We tested three types of contrasts here:
provided in Table 1. word vs. part-word, word vs. non-word, and part-word vs. non-word
For the SICR task, similarly to the study of Isbilen et al. (2017), six- sequences. Comparing different contrast types allows testing SL with
syllable sequences were generated. In the present study, these sequences different levels of sensitivity: the word vs. part-word contrast shows how
consisted of two words, two part-word or two non-word foils, eight se sensitive participants are to differences between stronger compared to
quences from each type. The specific sequences are described in Table 2. weaker TPs (TP = 1 vs. TP = 0.33), the word vs. non-word contrast
shows how sensitive participants are to differences between strong TP
2.1.3. Procedure sequences over sequences with zero TPs (TP = 1 vs. TP = 0), and finally,
Stimuli were presented using E-Prime 2.0 Professional software the part-word vs. non-word contrast is informative about whether par
(Psychology Software Tools, Inc.). During the three phases of the ticipants favor even weaker TP sequences over sequences with zero TPs
experiment, participants listened to the stimuli through headphones. A (TP = 0.33 vs. TP = 0). The specific sequences in each contrast were
schematic illustration of the procedure is provided in Fig. 1. randomly selected for each participant, and the order of trigram types
The first phase of the experiment was an online learning phase based varied within contrasts. In each trial, participants heard the two strings
through headphones, and they were instructed to indicate which one
was more familiar based on the training streams. We quantified learning
Table 1 by accuracy rates: the number of correct answers divided by the number
Words, part-words and non-words used in the 2AFC task. of all trials, yielding a number between 0 and 1.
Words Part-words Non-words In the SICR task, in each trial, participants were presented with a six-
syllable sequence which they had to immediately recall. In each trial, a
cégávi kicégá céhatu
detamü mühago dekisá sequence was presented through headphones, and participants had to
hagoki tudeta gápego recite it immediately after the presentation. The order of trials was
sápetu visápe mütavi randomized across participants. To quantify performance in the SICR
3
Fig. 1. The procedure of the experiment. In the first phase, participants completed an online target detection task. After this phase, there were two more tasks: a two-
alternative forced choice (2AFC) task and a statistically-induced chunking recall (SICR) task. The order of the two latter was counterbalanced across participants to
eliminate potential order effects.
task, we calculated an accuracy rate for each sequence type (word, part- blocks, and a learning effect reflected in significantly longer RTs for
word, and non-word) by dividing the number of correctly recalled unpredictable syllables in the random block than in the preceding and
bigrams by the number of all bigrams, yielding a number between 0 and subsequent word blocks. The changes in RTs through blocks are illus
1. This method differed from that of Isbilen et al. (2017), who quantified trated in Fig. 2.
learning as the number of correctly recalled syllables. We chose this As the assumption of normal distribution was violated in the case of
score as it provides the best resolution of recalled sequences while still accuracies, as well, a Friedman test was used to test differences between
reflecting the acquisition of item relations, which is a core element of SL. the blocks. This test also showed a significant block effect, χ2(4) = 42.07,
p < .001. The post hoc Wilcoxon signed ranks tests after Holm-
Bonferroni sequential corrections showed that there was a significant
2.2. Results
increase in hit rates between the first two blocks, and participants were
significantly less accurate in their responses in the random block
2.2.1. Online target detection task
compared to the preceding and following word blocks, but there was no
In the online target detection task, RTs were collected from accurate
significant difference between the second and third training blocks
button presses for targets within a 1200 ms time window from stimulus
(Block TRN1 < Block TRN2: Z = − 3.17, p = .003, r = 0.61; Block TRN2
onset. Accuracies were calculated by dividing the number of correct
< Block TRN3: Z = − 1.87, p = .062, r = 0.36; Block TRN3 > Block RND:
responses on targets by the number of targets. To analyze reaction times
Z = − 3.46, p = .002, r = 0.67; Block RND < Block REC: Z = − 3.77, p <
and accuracies, we calculated the median of RTs and accuracies for each
.001, r = 0.73). Accuracies through blocks are shown in Fig. 3.
block by participant. Descriptive statistics for RTs and accuracies are
shown in Table 3.
2.2.2. Two-alternative forced choice (2AFC) task
RT changes through blocks 1–5 were analyzed with a Friedman test
As there was no order effect either in the 2AFC task or in the SICR
(as the assumption of normality of residuals in the parametric ANOVA
task, task order was not taken into account in the analysis of the post-
was violated), and a significant effect of Block was found, χ2(4) = 70.09,
familiarization tasks. In the 2AFC task we tested three contrasts: 1)
p < .001. Post hoc Wilcoxon signed ranks tests revealed that RTs in each
block were significantly different from the previous one (we applied
Holm-Bonferroni sequential corrections because of the multiple com
parisons, Gaetano, 2018: Block TRN1 > Block TRN2: Z = − 3.03, p = 1000
** *** *** ***
.002, r = 0.58; Block TRN2 > Block TRN3: Z = − 4.13, p < .001, r = 0.80;
Block TRN3 < Block RND: Z = − 4.52, p < .001, r = 0.87; Block RND >
Block REC: Z = − 4.52, p < .001, r = 0.87). That is, we found a learning 750
and/or practice effect yielding decreasing RTs through the first three
median RT
Table 3
Descriptive statistics of median RTs and accuracies by block in the online target 500
detection task.
Block Block RT RT range ACC ACC
type median median range
250
Block Training 400.00 268.00–797.00 0.60 0.33–0.87
TRN1
Block Training 342.50 105.00–658.00 0.73 0.47–1.00
TRN2 0
Block Training 176.00 70.00–627.00 0.80 0.40–1.00
TRN3 TRN1 TRN2 TRN3 RND REC
Block Random 611.00 485.00–816.00 0.53 0.20–1.00
RND Fig. 2. Median reaction times by block. Boxes indicate RT data between the
Block Recovery 236.00 35.00–631.50 0.80 0.33–1.00
first and third quartiles, vertical lines indicate medians, and whiskers illustrate
REC
the range of data in each group.
4
used recall performance for word sequences as an index of learning. As

this measure might be influenced by individual differences in baseline
** n.s. ** *** verbal short term memory capacity, difference scores between more
1.00 structured and less structured sequences are used here, which can show
SL effects without the short term memory confound. Median bigram
recall scores were 0.93 (range: 0.43–1.00) for word, 0.70 (range:
0.75 0.33–1.00) for part-word, and 0.70 (range: 0.30–0.95) for non-word
accuracy
sequences. Beside calculating contrast measures for each participant,

an overall difference measure was also calculated by averaging the three
0.50 difference measures (word > part-word, word > non-word, part-word >
non-word). Descriptive statistics for the SICR task measures are shown in
Table 5. With the exception of the part-word > non-word contrast, all
0.25 scores had a value significantly above the chance level of 0.
2.2.4. Relationships between different measures

0.00 We further examined the associations between performance measures in
the different SL tasks. For the online target-detection task, a RT training
TRN1 TRN2 TRN3 RND REC
measure was calculated by subtracting median RTs in Block TRN3 from RTs
Fig. 3. Accuracies by block. Boxes indicate accuracy data between the first and in Block TRN1. Also, a RT difference score was calculated by averaging the
third quartiles, vertical lines indicate medians, and whiskers show the range of differences between median RTs in the random block and the last training
data in each group. block, and the random block and the recovery block:
RT difference = (Block RND RT mean− Block TRN3 RT mean)+(Block RND RT mean− Block REC RT mean)
word vs. part-word, 2) word vs. non-word, and 3) part-word vs. non- 2
word sequences. For each contrast, a score was calculated by dividing

While the RT training measure reflects online learning performance,
the number of correct answers by the number of all trials. The overall
it ignores the dynamics of learning during the first three blocks: it does
measure was calculated by averaging accuracy rates in the three con
not distinguish between participants with faster and slower RT decreases
trasts. The measures analyzed in the 2AFC task are detailed in Table 4.
as long as they show the same decrease. As an attempt to capture dif
All 2AFC measures showed significant differences from the chance level
ferences in the dynamics of learning, we created a third exploratory
of 0.5, except for the 2AFC part-word > non-word, where the learning
online learning measure labeled learning efficiency. Learning efficiency
effect only approached significance.
was calculated by multiplying a learning speed score, a RND-TRN3 RT
difference score and a TRN3 RT spread score together:
2.2.3. Statistically-induced chunking recall task
For the statistically-induced chunking recall task, data from two learning efficiency = learning speed*RND − TRN3 RT difference
participants were excluded from the analysis, for one participant there *TRN3 RT spread
was an error in data recording, and another participant didn’t give
consent to record his responses. We examined the proportions of The learning speed score indicates how quickly RTs of a given
recalling bigrams (i.e. consecutive syllable pairs) from the six-syllable participant reached the level of the last training block (Block TRN3)
sequences consisting of words, part-words or non-words. Then, for within the training blocks (Blocks TRN1 to TRN3) on a trial-by-trial
each participant, the learning measure was calculated for each contrast level; the RND-TRN3 RT difference score shows the difference between
by extracting the recall performance of the less structured conditions median reaction times in Block RND and Block TRN3; while the TRN3
from performance in the more structured conditions (e.g., the measure RT spread score describes the interquartile range of reaction times in
for the word > part-word contrast was calculated as the ratio of correctly Block TRN3. The latter two scores were included because learning speed
recalled bigrams for word sequences minus the ratio of correctly recalled in itself cannot reliably indicate that learning actually occurred during
bigrams for part-word sequences).
This measure differs from the ones used by Isbilen et al. (2017) in two Table 5
aspects. First, they measured the number of correctly recalled syllables Descriptive statistics and differences from chance level for individual SICR dif
and trigrams. In the present study, the proportion of correctly recalled ference scores.
bigrams is used, as this is the most fine-grained measure which reflects Type Median Range Z or t p(x > Effect
the learning of syllable co-occurrences. Second, Isbilen and colleagues value 0) size
1. SICR overall 0.10 − 0.10–0.40 t(24) = p< r = 0.73

5.16 .001
Table 4 2. SICR word > part- 0.10 − 0.05–0.45 Z = 3.90 p< r = 0.78
Descriptive statistics and the differences from chance level in the case of 2AFC word .001
measures. 3. SICR word > non- 0.15 − 0.15–0.60 t(24) = p< r = 0.73
word 5.16 .001
Type Median Range Z or t p(x > Effect
4. SICR part-word > 0.08 − 0.28–0.50 Z = 0.78 p= r = 0.16
value 0.5) size
non-word .433
1. 2AFC overall 0.75 0.46–0.88 Z = 4.33 p< r = 0.83
Note. Individual SICR scores were calculated by subtracting a subject’s ratio of
.001
2. 2AFC word > part- 0.75 0.38–1.00 Z = 4.29 p< r = 0.83 correctly recalled bigrams in a less structured sequence type from that of a more
word .001 structured sentence type (e.g., the SICR word > part-word contrast was calculated
3. 2AFC word > non- 0.75 0.38–1.00 Z = 4.23 p< r = 0.81 by subtracting the performance on part-words from the performance on words).
word .001 Therefore, positive values in SICR scores mean better performance, negative
4. 2AFC part-word > 0.63 0.13–1.00 t(26) = p= r = 0.36 scores mean worse performance in more structured sequences compared to less
non-word 1.99 .057 structured ones, while 0 means no difference. Difference from chance level of
Note. Difference from the chance level of 0.5 was tested using a t-test in the case 0 was tested by t-tests in the case of measures 1 and 3, and Wilcoxon signed rank
of measure 4, and Wilcoxon signed rank tests in the case of measures 1, 2 and 3 tests in the case of measures 2 and 4 (as they were not normally distributed).
(as they were not normally distributed).
5
the training blocks. For instance, a participant may show very fast measure of the word segmentation paradigm to SL processes. This online
learning (identical RTs at the beginning and at the end of the training) if target detection task proved to be capable of assessing SL: the decreasing
their RTs did not decrease through the training blocks. But if their RTs at RTs and increasing accuracies through the first three blocks show that
the end of the training did not differ from RTs in the reference block the paradigm is suitable for measuring learning in its process, providing
(Block RND), or their RTs were too scattered at the end of training, the data about its temporal properties, and the difference between the
lack of change through Blocks TRN1 to TRN3 cannot be attributed to fast random block and the neighboring word blocks showed a learning effect
learning. A detailed description of the three scores is provided in Ap that is not influenced by motor practice. We also calculated another
pendix A. In the case of the 2AFC and SICR measures, the overall scores exploratory online measure, learning efficiency, which, contrary to the
were used, as these cover the testing of all three contrasts. other two online measures, RT training and RT difference, is capable of
We calculated Spearman-Brown reliability coefficients of split-half capturing the dynamics of learning. RT training and RT difference had
correlations for the indices used in the correlational analyses through relatively high split-half reliabilities. Although they did not reach the
5000 iterations. For each measure, in each iteration, we divided the conventional psychometric standard of 0.80, these values outscore in
trials of the task into two random sets with the constraint that different ternal consistencies measured in linguistic statistical learning studies
trial types were represented equally in the two sets (two random sets for with the 2AFC task (Arnon, 2019; Siegelman, Bogaerts, Elazar, et al.,
RTs in the different blocks in the target detection task for the online 2018). Learning efficiency had more moderate reliability, however, as
measures, and two random sets for the different contrast types in the this was an exploratory measure, we aimed to test its adequacy further in
2AFC and SICR tasks), calculated task indices based on the responses in Experiment 2.
the two trial sets for each participant, and correlated indices calculated In Experiment 1, we looked at the relationship of the novel target
from these two sets. We applied Spearman-Brown correction to the ob detection task measures and two offline measures from the 2AFC task,
tained Pearson correlation coefficients in each iteration as the split-half and the SICR task by Isbilen et al. (2017). We expected the target
method underestimates reliabilities. Mean reliability coefficients were detection and SICR measures to show a significant correlation, as both
0.71 for RT training (2.5th percentile: 0.46; 97.5th percentile: 0.86), 0.70 are hypothesized to be more adequate measures of SL than the widely
for RT difference (2.5th percentile: 0.50; 97.5th percentile: 0.86), and used 2AFC task. We also expected that the 2AFC measure, due to its
0.61 for learning efficiency (2.5th percentile: 0.27; 97.5th percentile: deficiencies, would show weaker correlations with the online item
0.83). Reliabilities of the offline measures were more moderate: 0.58 for detection and SICR task scores. Contrary to our hypotheses, the SICR
2AFC overall (2.5th percentile: 0.30; 97.5th percentile: 0.78), and 0.43 measure was not associated with any other measures, and we found a
for SICR overall (2.5th percentile: 0.07; 97.5th percentile: 0.71). significant correlation between two of our online measures (the RT
In the analysis of relationships between RT training, RT difference, difference and learning efficiency) and 2AFC scores. 2AFC scores were not
2AFC overall, and SICR overall scores the two RT measures showed a particularly reliable, being in the range of earlier findings (Arnon, 2019;
significant correlation, and there was also a significant correlation be Siegelman, Bogaerts, Elazar, et al., 2018). The internal consistency of
tween the RT difference and the 2AFC overall measures. On the other the SICR measure was especially weak.
hand, the SICR overall score did not correlate with any of the other task A possible explanation for the relationship of online and offline 2AFC
measures. Results of this correlation analysis are detailed in Table 6. scores is that as the presentation rate was relatively slow (compared to
We also examined the relationship between learning efficiency and the presentation rates from 215 to 300 ms, Franco et al., 2015; Batterink
2AFC overall and the SICR overall measures. A distinct correlation et al., 2015; Batterink & Paller, 2017), and the word set was relatively
analysis was made as learning efficiency shares variability with both RT small and well separable (four words with TPs of 1 between syllables
training and RT difference. Learning efficiency was significantly correlated within words), the structure of the streams could have been transparent
with the 2AFC overall score, but the SICR overall score did not correlate for participants. Earlier studies with visual SL tasks found that under
with the other two measures (Table 7). explicit instructions and slower presentation rates (800 ms, 6000 ms,
For a more direct comparison of our results and those of Isbilen et al. 14,000 ms) participants may gain explicit knowledge of the stimuli (as
(2017), we also included three further measures used in their original discussed by Arciuli, Torkildsen, Stevens, & Simpson, 2014). While there
study, the 2AFC word > non-word contrast, the SICR word syllable-by- were no explicit instructions to look for words in the present study, the
syllable scores (the proportion of recalled syllables in the case of six- properties of the word set and the slower presentation rate could lead to
syllabic sequences consisting of words), and the a SICR word trigram the formation of explicit representations, which then would facilitate
scores (the proportion of recalling trigrams in the case of six-syllabic performance on the online target detection task, as well as on the 2AFC
sequences consisting of words), in two additional analyses. There was task, mediating the relationship between them. In a second experiment,
no significant relationship either between the 2AFC word > non-word we aimed to explore this possibility. As the correlations between the
and SICR word syllable-by-syllable scores (rs = 0.02, p = .911), or between SICR measure and either the online target detection or 2AFC scores were
the 2AFC word > non-word and SICR trigram word scores (rs = 0.08, p = not significant, it may not, or may have differently been affected by
.708). explicit representations. Because of the lack of a significant association,
we did not include the SICR task in Experiment 2.
2.3. Discussion 3. Experiment 2
In Experiment 1, we tested the sensitivity of a novel target detection In Experiment 1, participants may have developed explicit
Table 6
Correlations between online and offline measures. Table 7
Correlations between learning efficiency and offline measures.
Measure 1 2 3
Measure 1 2
1. RT training
2. RT difference 0.68* 1. Learning efficiency
3. 2AFC overall 0.19 0.49* 2. 2AFC overall 0.59*
4. SICR overall − 0.13 − 0.14 0.10 3. SICR overall 0.08 0.10
Note. As the RT difference and 2AFC overall scores did not have a normal dis Note. As the 2AFC overall score did not have a normal distribution, Spearman’s rs
tribution, Spearman’s rs coefficients are reported here. coefficients are reported here.
* *
p < .01. p < .01.
6
representations of words or segments predicting target syllables, and 0.66; Block TRN2 > Block TRN3: Z = − 3.16, p = .002, r = 0.40). RTs in
this explicit knowledge could have facilitated performance in the online the random block were also significantly different from RTs in the
target detection and the 2AFC task, driving associations between the RT neighboring word blocks (Block TRN3 < Block RND: Z = − 6.50, p <
difference scores and the 2AFC measures. To explore this possibility, we .001, r = 0.83; Block RND > Block REC: Z = − 6.41, p < .001, r = 0.81).
wanted to look at the relationship between the online reaction time (RT Changes in median RTs are illustrated in Fig. 4.
difference and learning efficiency) and the offline 2AFC measures in more For analyzing the learning process in accuracy changes in target
detail taking into account the explicit knowledge of words by partici detection, we also used a Friedman ANOVA and found a significant ef
pants. Thus, in Experiment 2, we tested participants with the same word fect of Block, χ2(4) = 36.48, p < .001. Post hoc Wilcoxon signed rank
segmentation task without the SICR task, as we found no associations tests with Holm-Bonferroni corrections for significance levels showed
between the scores of this task and measures from the other two tasks in that through the first three blocks, there was a significant increase in
Experiment 1. We also assessed participants’ explicit knowledge of the accuracies for targets between the blocks TRN1 and TRN2, but not be
training stimuli in a debriefing session after the task. tween blocks TRN2 and TRN3 (Block TRN1 < Block TRN2: Z = − 2.26, p
= .048, r = 0.28; Block TRN2 < Block TRN3: Z = − 1.65, p = .099, r =
0.21). Accuracies in the random block were significantly lower than in
3.1. Method
the neighboring word blocks (Block TRN3 > Block RND: Z = − 4.15, p <
.001, r = 0.52; Block RND < Block REC: Z = − 4.81, p < .001, r = 0.60).
3.1.1. Participants
Changes in target accuracies through blocks are illustrated in Fig. 5.
64 university students (mean age: 20.88, SD = 2.37, 45 females, 8
left-handed) participated in the experiment for course credit; they were
3.2.2. Two-alternative forced choice (2AFC) task
all native speakers of Hungarian. Before the experiment, all of them gave
In the 2AFC task, as in Experiment 1, we tested three contrasts: 1)
an informed consent, in accordance with the principles set out in the
word vs. part-word, 2) word vs. non-word, and 3) part-word vs. non-
Declaration of Helsinki and the stipulation of the local IRB. The study
word sequences. For each contrast, a score was calculated by dividing
was approved by the United Ethical Review Committee for Research in
the number of correct answers by the number of all trials. The overall
Psychology (EPKEB). Participants were informed about the purpose of
measure was calculated by averaging accuracy rates in the three con
the research after the study.
trasts. These measures are detailed in Table 9. We found a significant
difference from the chance level of 0.5 in the case of all 2AFC measures.
3.1.2. Stimuli
Stimuli were the same as in Experiment 1, without stimuli for the
3.2.3. Relationships between online and offline measures
SICR subtask.
As in Experiment 1, we assessed the reliability of Experiment 2
measures. The method of reliability estimation was identical to that of
3.1.3. Procedure
Experiment 1: we calculated Spearman-Brown reliability coefficients of
Participants completed the experiment in groups of two or three. The
split-half correlations in 5000 random samples for each measure. Mean
procedure was the same as that of Experiment 1 without the SICR sub
reliability coefficients were 0.74 for RT training (2.5th percentile: 0.62;
task. Participants completed the online learning task first, and then the
97.5th percentile: 0.83), 0.86 for RT difference (2.5th percentile: 0.80;
offline 2AFC task. After the experiment, they filled out a debriefing form
97.5th percentile: 0.91), and 0.68 for learning efficiency (2.5th percen
consisting of a set of open-ended questions asking about their knowledge
tile: 0.53; 97.5th percentile: 0.80). Reliability of 2AFC overall was 0.37
of the stimuli used in the experiment. Questions of the debriefing session
(2.5th percentile: 0.13; 97.5th percentile: 0.57).
are listed in Appendix B.
We used the RT difference and the 2AFC overall measures for assessing
the relationship between the task scores, as these measures were
3.2. Results correlated in Experiment 1. The correlation between these measures was
significant, but moderate (rs = 0.32, p = .009). With the purpose of
3.2.1. Online target detection task testing its adequacy, we also examined the relationship of learning effi
To analyze reaction time and accuracy data through the online task, ciency to the 2AFC overall measure, and here too, there was a weak, but
we calculated median RTs as well as target accuracies for each block by
participant as in Experiment 1. Descriptive statistics for these measures
are shown in Table 8.
We analyzed RTs through the five blocks with a Friedman ANOVA 1000
*** ** *** ***
(as the assumption of normal distribution of the residuals in the para
metric ANOVA was violated) and found a significant effect of Block,
χ2(4) = 116.61, p < .001. Two participants were excluded from this
750
analysis as they gave no responses in Block TRN1. For the post hoc
Wilcoxon signed ranks tests we applied Holm-Bonferroni sequential
median RT
corrections because of the multiple comparisons. These pairwise com

parisons showed that there was a significant decrease in the first three 500
training blocks (Block TRN1 > Block TRN2: Z = − 5.20, p < .001, r =
Table 8
250
Descriptive statistics of median RTs and accuracies by block in the online target
detection task.
Block Block RT RT range ACC ACC
type median median range 0
Block TRN1 Training 412.50 198.00–780.50 0.67 0.00–1.00
Fig. 4. The distribution of median RTs by block. Boxes indicate RT data be
Block RND Random 597.50 364.00–762.00 0.60 0.27–0.93
Block REC Recovery 310.00 53.00–698.00 0.73 0.27–0.93 tween the first and third quartiles, vertical lines indicate medians, and whiskers
show the range of data in each group.
7
and were excluded from analyses exploring this effect.

Comparing performance of groups of participants divided by
* n.s. *** *** knowledge of local patterns, we found a significant difference between
1.00 the two groups in the RT difference scores, t(59) = − 4.68, p < .001, r =
0.52, and in the learning efficiency scores, Z = − 2.84, p = .005, r = 0.36,
but no significant differences in 2AFC overall scores, Z = − 1.54, p =
0.75 .123, r = 0.20. Descriptive statistics of groups are reported in Table 10.
accuracy
The association between online and offline measures was not significant
in the case of those who knew what predicted their target syllables (RT
0.50 difference and 2AFC overall: rs = 0.06, p = .744; learning efficiency and
2AFC overall: rs = 0.04, p = .810), but there was a significant relation
ship between online and offline test scores in the case of participants
0.25 who did not have explicit knowledge about what syllables or syllable
combinations predicted their targets (RT difference and 2AFC overall: rs
= 0.50, p = .013; learning efficiency and 2AFC overall: rs = 0.55, p =
0.00 .006). The difference between the strength of associations in the pres
ence and absence of local pattern knowledge was not statistically sig
nificant either in the case of RT difference and 2AFC overall scores (Z =
Fig. 5. The distribution of accuracies by block. Boxes indicate accuracy data 1.51, p = .132), or in the case of learning efficiency and 2AFC overall
between the first and third quartiles, vertical lines indicate medians, and scores (Z = 1.50, p = .134).
whiskers show the range of data in each group. A similar pattern was observed when participants were grouped by
knowledge of the global pattern. There was a trend-like difference be
tween the groups in RT difference scores, Z = − 1.76, p = .079, r = 0.23,
but there was no difference for learning efficiency scores, Z = − 1.33, p =
Table 9 .185, r = 0.17, and 2AFC overall scores, Z = − 0.44, p = .662, r = 0.06.
Accuracy rates and the differences from chance level in the case of 2AFC Descriptive statistics of groups are shown in Table 11. The relationship
measures. between online and offline measures was not significant in those who
Type Median Range Z value p(x > Effect noticed that the stream consisted of words (RT difference and 2AFC
0.5) size overall: rs = 0.14, p = .550, learning efficiency and 2AFC overall: rs =
1. 2AFC overall 0.71 0.42–0.92 Z= p < .001 r = 0.85 − 0.15, p = .507), but it was significant in those who did not have explicit
6.81 knowledge about the pattern (RT difference and 2AFC overall: rs = 0.38,
2. 2AFC word > part- 0.75 0.25–1.00 Z= p < .001 r = 0.78
p = .018, learning efficiency and 2AFC overall: rs = 0.55, p < .001). The
word 6.20
3. 2AFC word > non- 0.88 0.50–1.00 Z= p < .001 r = 0.84 difference between these correlations in the presence and absence of
word 6.75 global knowledge was not statistically significant in the case of RT dif
4. 2AFC part-word > 0.63 0.13–1.00 Z= p = .004 r = 0.36 ference and 2AFC overall scores (Z = 1.56, p = .118), but the correlations
non-word 2.88 of learning efficiency and 2AFC overall scores in the presence and absence
Note. Differences from chance level of 0.5 were tested using Wilcoxon signed of global pattern knowledge differed significantly (Z = 2.95, p = .003).
rank tests, as the measures were not normally distributed. The relationship of online and offline measures in samples divided along
aspects of recognizing local and global patterns is illustrated in Figs. 6
and 7.
significant relationship (rs = 0.26., p = .037).
To see whether developing explicit representations of words
contributed to correlations between the online and offline measures, we 3.3. Discussion
divided participants along two dimensions derived from the answers
they gave in the debriefing questionnaire. Responses were scored by two Experiment 2 was designed to examine the effect of forming explicit
independent raters. The first dimension reflected whether they noticed representations on the relationship between measures of the online
any local pattern in the stimuli, that is, whether they had any knowledge target detection task and the 2AFC task. We replicated findings of
about what syllable or set of syllables predicted their target syllable. If a Experiment 1, as we found a positive relationship between the online
participant gave responses to Questions 1–4 implying knowledge of a target detection measures (RT difference and learning efficiency) and the
local pattern, i.e., what preceded their target syllable, they got an overall 2AFC task measures. Contrary to our assumption, though, this rela
score of 1 in the local pattern dimension. These responses included tionship was only present in the absence of explicit knowledge about
sentences like “My target syllable occurred after the syllable ‘go’.”, “I local or global patterns. We found no evidence of an association between
knew which syllable preceded my target.”, or “‘Go’ was always followed measures in the groups of participants who had explicit knowledge
by ‘ki’.”. If the answers gave no indication of local pattern knowledge, about what predicted their target syllable in the online task, or about the
they got an overall score of 0. If their responses were ambiguous across
Questions 1–4, they were not scored and were excluded from analyses
investigating the effect of local pattern knowledge. The second dimen
Table 10
sion reflected whether they noticed a global pattern, that is, whether Descriptive statistics of RT difference, learning efficiency and 2AFC overall
they knew that the stimuli consisted of words. If a participant gave re scores grouped by knowledge of local patterns.
sponses to Questions 1–4 implying knowledge of the global pattern, i.e.,
Measure Knowledge of local patterns
that the streams consisted of words, they got a score of 1 in the global
pattern dimension. These responses included sentences like “There were No Yes
words in the speech stream.”, “Syllables could be combined.”, or “There Median Range Median Range
were patterns in the speech stream.”. If the answers gave no indication of 1. RT difference 162.50 − 315.25–556.50 402.50 9.00–593.50
global pattern knowledge, they got a score of 0. If their responses were 2. Learning efficiency 0.05 − 0.12–0.32 0.15 0.01–0.30
ambiguous regarding global pattern knowledge, they were not scored 3. 2AFC overall 0.69 0.42–0.83 0.71 0.50–0.92
8
Table 11 observed weak internal consistency for 2AFC task scores, replicating
Descriptive statistics of RT difference, learning efficiency and 2AFC overall earlier findings (Arnon, 2019; Siegelman, Bogaerts, Elazar, et al., 2018).
scores grouped by knowledge of global patterns.
Measure Knowledge of global pattern 4. General discussion
No Yes
In this paper, a novel online target detection task was introduced to
Median Range Median Range
test statistical word segmentation abilities. Most SL tasks measure
1. RT difference 362.75 − 110.00–539.50 265.00 45.75–564.00 learning only at a given point, preventing gaining insight into the pro
2. Learning efficiency 0.13 − 0.05–0.32 0.05 0.00–0.30
cess of learning throughout the task. The present task was designed to be
3. 2AFC overall 0.71 0.42–0.92 0.75 0.50–0.88
suitable for tracking learning online through reaction times and accu
racies. The aim of Experiment 1 was to test the sensitivity of online
measures from this task to SL processes, and to look at their relationship
fact that the streams consisted of words. This suggests developing with measures from a 2AFC task and the SICR task by Isbilen et al.
explicit representations about the stimuli was not the factor explaining a (2017). Recently, several authors in the SL literature raised concerns
significant relationship between the online target detection and the that 2AFC might be influenced by other cognitive processes, like
2AFC task measures, and implies that implicit extraction of structure deliberation and decision-making (e.g., Arnon, 2019; Christiansen,
from sound patterns can account for the positive relationship found in 2018; Isbilen et al., 2017), and therefore it may not be a suitable mea
Experiment 1 and 2. sure for SL. The online target-detection and the SICR tasks are
Furthermore, our online measures showed good split-half reliability processing-based measures (Christiansen, 2018), which have the
values, especially the RT difference score. On the other hand, we advantage of not requiring reflection on the acquired representations,
therefore, they are both good candidates to be suitable measures for SL.
A) Explicit knowledge about global pattern B) No explicit knowledge about global pattern
C) Explicit knowledge about local pattern D) No explicit knowledge about local pattern
Fig. 6. The relationship between the 2AFC overall and RT difference measures in groups of participants A) who had explicit knowledge of global patterns, B) who did
not have any explicit knowledge of global patterns, C) who had explicit knowledge about local patterns, D) who did not have any explicit knowledge of local patterns
in the experimental stimuli. For illustrative purposes, we fitted a regression line to the data, and a minimal amount of jitter was added to increase visibility.
9
A) Explicit knowledge about global pattern B) No explicit knowledge about global pattern
C) Explicit knowledge about local pattern D) No explicit knowledge about local pattern
Fig. 7. The relationship between the 2AFC overall and learning efficiency measures in groups of participants A) who had explicit knowledge of global patterns, B) who
did not have any explicit knowledge of global patterns, C) who had explicit knowledge about local patterns, D) who did not have any explicit knowledge of local
patterns in the experimental stimuli. For illustrative purposes, we fitted a regression line to the data, and a minimal amount of jitter was added to increase visibility.
In Experiment 1, we found that the online target detection measure learning efficiency did not extend our findings, we believe it can be a
was sensitive to learning: RTs decreased and accuracies increased promising tool in atypical populations, where the dynamics of SL can be
through the three training blocks with streams of words, and the more affected and variable (e.g., Developmental Language Disorder,
disruption of structure in the random block resulted in a significant in Developmental Dyslexia, etc.) than what we see in typical performance.
crease in reaction times and decrease of accuracies, relative to both the In these cases, this measure of the dynamics of learning can be an
previous and next word blocks. These results – the decrease in RTs and additional tool of capturing the nature of differences between pop
increase in accuracies – show that the prediction of targets becomes ulations. However, despite being a promising measure, its relatively low
better over time and likely reflect participants’ increasing sensitivity to reliability needs further attention.
the statistical structure of the stream. The presence of such a learning Despite the promising results, our paradigm is not without limita
effect makes the online target detection task a good candidate for a more tions. In order to create an online target detection task which is capable
targeted SL measure. This is also supported by reliability estimates of of tracking learning from the beginning and also feasible for participants
task indices: RT training and RT difference both had high internal con who are exposed to the stimuli for the first time, we used a slow pre
sistencies across the two experiments. We also introduced and verified sentation rate. This rate of 500 ms (with 400 ms long syllables and 100
the adequacy of the learning efficiency measure. In contrast to the other ms long pauses), diverges from the timing of natural language, which is a
two online measures based on reaction time differences between blocks, limitation of this task. We are currently working on a version with a
this measure is sensitive to the dynamics of learning. In both experi faster presentation rate (with 270 ms syllable duration and 30 ms pau
ments, learning efficiency showed similar associations with offline mea ses) to test whether the presentation rate of syllables has an effect on
sures as the RT difference online score. While in our study, introducing learning.
10
As one reviewer pointed out, responses to last syllables could serve as replicated in Experiment 2). This result is similar to the findings of
cues to word boundaries, thus enhancing performance in the 2AFC task: Siegelman and his colleagues (Siegelman, Bogaerts, Kronenfeld, & Frost,
that is, correct motor responses reliably coincided with word bound 2018), who tracked learning online utilizing a self-paced learning task in
aries. Indeed, results from the statistical learning literature show that a visual nonverbal segmentation paradigm. They found a positive as
input from other than the primary input modality can contribute to sociation between online scores derived from the self-paced learning
learning. In a word segmentation study with four months old infants phase and 2AFC scores within the same task. On the other hand, we
(Seidl, Tincoff, Baker, & Cristia, 2015), babies were more successful in found no association between the RT training scores and 2AFC scores.
extracting a word from a speech stream if their knees or elbows were Earlier studies examining the relationship between the 2AFC task
touched by the experimenter during they were presented the word in the and target detection tasks found partly different patterns of the associ
stream. However, the present online target detection paradigm was ations. Franco et al. (2015), and Batterink et al. (2015) found no cor
different from that of Seidl et al. (2015) in two important aspects. First, relation between a 2AFC task and their post-training target detection
in our task, participants’ motor responses were self-initiated, not measure, while Batterink and Paller (2017) found a positive relationship
external signals, so they might be less likely to have a cue value. Second, between a familiarity rating task and their item-detection task (they did
infants got a continuous tactile input during the target word. Moreover, not analyze associations between their target detection and 2AFC tasks).
self-initiated motor responses are not reliable cues in natural languages However, there are significant differences between these earlier studies
like, for instance, changes in prosody (e.g., Kabak, Maniwa, & Kazanina, and the current one. The target detection task in the present study
2010; Langus, Marchetto, Bion, & Nespor, 2012; Morgan, Meier, & differed from the detection tasks in previous studies in three important
Newport, 1987). In sum, while it cannot be excluded that motor re aspects, which could all contribute to the different patterns we got: 1)
sponses served as cues to word boundaries, this does not necessarily we were monitoring responses to syllables during training, in contrast
undermine the validity of learning based on transitional probabilities with the post-familiarization tasks utilized by the other studies; 2) there
between syllables. was only one target syllable, in contrast with the other studies, where
As in the current online target detection task the instruction was to target syllables were alternating during the task; 3) we measured
respond to the last syllable of a word, a further possible criticism is that learning by comparing RT performance of syllables in the last position of
instead of gaining information about the extraction of words, we only words versus syllables in a pseudorandom order, while earlier studies
had gathered evidence about the acquisition of a given syllable pair (the contrasted performance for syllables in different positions within words,
target bigram). In this paradigm, words are defined as a sequence of two yielding varying predictability for target syllables in 1st, and 2nd and
syllable pairs with higher transitional probabilities bounded by syllable 3rd positions (that is, the first syllables of words are still predictable to
pairs with lower transitional probabilities, so detecting “words” reflects some extent, making the pseudorandom stream a more adequate con
learning of sequences of syllables pairs. Different experimental designs dition for measuring processing of unpredictable syllables).
can provide more information about the entire syllable triplet forming a We hypothesized that forming explicit representations about words
word. One possibility is to have an alternating target in the online task, could be a mediating factor in the relationship between the online and
so that reaction time and accuracy data can be collected about all syl 2AFC measures and we conducted Experiment 2 to examine this possi
lables during the learning process. Another possible solution is designing bility. The results showed that the association between scores on the two
a post hoc 2AFC task in which the items also test whether participants tasks is only observed in participants without explicit knowledge of
learned other transitional probabilities than the one involving the target structure in the stream. This suggests that this relationship is not likely
syllable, for instance, by including word-foil contrasts where both se to be the result of forming explicit representations enhanced by the
quences include the target bigram, so that familiarity decisions cannot relative transparency of the stimulus set, and the online target detection
be based on the presence of this bigram. task and the 2AFC task might tap into similar SL processes. One possi
To sum up, the online target detection paradigm is a promising task bility is that they both reflect the operation of a single mechanism, with
for measuring SL, and additional studies may further verify and improve the small amount of shared variation caused by methodological con
its validity. This should be complemented by testing further its psy founds in the tasks (e.g., by psychometric and methodological short
chometric properties essential for assessing individual differences comings of the 2AFC task). Another possibility is that multiple processes
effectively (Siegelman et al., 2017). are at work during SL and they affect performance of the two tasks to a
We found evidence of learning in the 2AFC task as well, similarly to different extent so that they do not share a large variance. These mul
many previous studies. For this measure, similarly to earlier studies, we tiple processes might be different mechanisms which drive learning, as
did not find high internal consistency in either of the two experiments acquisition of transitional probabilities (as suggested in the original
(Arnon, 2019; Siegelman, Bogaerts, Elazar, et al., 2018). For the SICR studies by Saffran, Aslin, & Newport, 1996; and Saffran, Newport, &
task, we replicated the findings of Isbilen et al. (2017): participants were Aslin, 1996) and chunk-formation (as proposed by e.g., Perruchet,
more successful in recalling sequences of words than sequences of non- 2018), or different stages of processing, like encoding of stimuli, pattern
words, and they also performed above chance level in the case of our detection, retention and retrieval (Bogaerts, Siegelman, & Frost, 2016;
overall learning score, indicating a significant learning effect. However, Frost et al., 2015). Further studies should shed light on this question.
reliability of task scores was very weak. The analysis of the relationship between 2AFC and SICR tasks
To see how measures from the new target detection task relate to replicated the findings of Isbilen et al. (2017): as no relationship was
other SL measures, we tested associations between measures from all observed between 2AFC and SICR task measures, these two measures
three tasks. We expected that the online task would have a positive seem to be independent. Moreover, contrary to our hypothesis, there
relationship with the SICR task, and a more modest, or insignificant was no positive association between the online target detection and
relationship with the 2AFC task. We also expected to see only a weak or SICR measures. A possible explanation is that SICR taps into a different
no relationship between the SICR and 2AFC tasks. Analyses of re aspect or mechanism of SL. Another plausible cause of this result is that
lationships between the measures only partly supported our hypotheses. both measures were considerably noisy, which could hide a potential
We found a positive correlation between the online target detection and relationship between the tasks. An additional possible source of this
2AFC measures: those who showed a bigger online learning effect in the pattern of results is that SICR is a production-based measure. As Isbilen
RT difference and learning efficiency scores were also more efficient in the et al. (2017) noted, “unlike 2AFC and reaction time tasks, SICR requires
2AFC task and had higher 2AFC overall scores (and this finding was both immediate comprehension and production on the part of the
11
learner” (pp. 568). Both 2AFC and target detection paradigms test SL 5. Conclusion
more from the side of perception: participants have to perform opera
tions on perceived sequences of items. In the SICR task, SL is measured Statistical learning contributes to the acquisition of many knowledge
from the side of production: participants have to articulate sequences of types and skills. Despite its crucial role in several domains of cognition,
items. That is, in our case, the item detection and 2AFC tasks address the there is no consensus about what tasks are appropriate for measuring it.
question “How does SL ability affect processing of incoming se The widely used judgement and 2AFC measures are criticized for their
quences?”, while the question for SICR can be formulated more like psychometric weaknesses, making them less favorable for assessing
“How does SL ability affect processing of incoming stimuli and pro either group-level effects or individual differences. As part of the efforts
duction of these represented sequences?”. Speech perception and pro to find more suitable measurements of statistical learning, we intro
duction are different processes. For instance, data from patients with duced a novel online target detection paradigm for statistical word
speech deficits suggest that verbal short term memory may not be uni segmentation offering measures that do not build on deliberation and
tary, with separate input and output buffers in operation (e.g., Howard decision making. We found that our new task is suitable for measuring
& Nickels, 2005; Martin, Lesch, & Bartha, 1999; see Jacquemot & Scott, statistical learning effects, provides an opportunity to track learning
2006, for a discussion). This is especially important in this case, as SICR online, and has favorable measures of reliability. This makes our task a
is essentially a verbal short term memory task. Consequently, the effect good candidate for investigating group-level effects, as well as individ
of SL on short term memory may not be unitary: it can influence the ual differences.
input side, perception, as well as the output side, production, possibly in We also examined the relationship between online learning and two
a distinct manner. This is an aspect that needs further investigation, for other measures from a statistically-induced chunking recall and a two-
instance, through measuring the effect of stimulus structure on pro alternative forced choice task. Performance levels were not correlated
cessing and not production-based short-term memory measures between all three tasks. The online target detection measure and the
(following the methods of digit span tasks which measure verbal short- two-alternative forced choice task were positively associated, and this
term memory by pointing to elements of sequences or matching of two relationship was only observed when participants did not form explicit
sequences), and including production-based measures for other task representations about stimuli. Scores of the statistically-induced
types too. chunking recall task were not correlated with performance on the
As in the case of earlier studies, we did not find very strong re other two tasks. This pattern of findings might reflect multiple statistical
lationships between different measures of SL, which can be explained learning processes or be a product of low reliability and noisiness of the
partly by their psychometric properties and noisiness, but may still raise measures. We hope that our study contributes to the quest to find suit
concerns regarding their validity. Studies looking for new approaches in able tests for assessing statistical learning, and inspires future studies to
SL measurement often build on the implicit notion that a single core systematically assess the psychometric properties of different measures.
mechanism is at work during SL (as described for the relationship be
tween priming and recognition measures in Shanks & Perruchet, 2002,
or different SL paradigms, Perruchet & Pacton, 2006), hence the purpose Declaration of competing interest
is to find a method which measures it with the largest accuracy.
Although eliminating methodological shortcomings is an important None.
effort, it is not necessarily the only source of variation between different
measures. If one assumes multiple mechanisms behind SL (e.g., learning Acknowledgements
TPs or chunk-formation), or takes the entire process of SL into consid
eration (e.g., encoding, pattern detection, retention, retrieval), it is This work was supported by the Momentum Research Grant of the
reasonable to assume that different tasks are sensitive to different SL Hungarian Academy of Sciences (Momentum 96233 ‘Profiling learning
mechanisms or processes. That is, the goal might rather be finding mechanisms and learners: individual differences from impairments to
multiple accurate measures, which together could better describe a excellence in statistical learning and in language acquisition’, PI: Ágnes
person’s SL ability that shapes behavior than scores obtained on a single Lukács). We thank the volunteers and students for participating in the
task (like in the case of executive functions, Miyake & Friedman, 2012). experiments. We are also grateful to Dorottya Dobó for her help in the
Further work is needed on different SL tasks testing less homogeneous design of the study, data collection and for her comments on the
populations, and systematically assessing and improving their psycho manuscript; to Fruzsina Krizsai for her help in data processing; and we
metric properties (see e.g., Siegelman et al., 2017, for a summary of would like to thank Bertalan Polner and Kornél Németh for their advices
shortcomings in SL testing and potential solutions). on the statistical analysis.
Appendix A. Description of the learning efficiency measure
The learning efficiency measure is defined as the multiplication of three scores:

learning efficiency = learning speed*RND − TRN3 RT difference*TRN3 RT spread
Learning speed describes how rapidly a given participant reached their “end state” of RT decrease within a given range on the trial-by-trial level.
This “end state” was operationalized as the median RT of the last training block (Block TRN3). For each participant, we made 36 of ten target long
slices out of the 45 targets in their training blocks (e.g., targets 1 to 10, targets 2 to 11, etc.). We calculated median RTs for each slice (note that each
slice could include missing responses). Then we took the slice which was smaller than or equal to the median of Block TRN3 plus 75 ms,2 and after
2
Note that we used 75 ms as a unitary range instead of using individual reaction time spreads (e.g., interquartile range), as the spreads in Block TRN3 greatly
varied across participants. That is, a participant with a larger reaction time spread in Block TRN3 would reach their “end state” of learning sooner than a participant
with the same median reaction times but smaller spread.
12
which none of the slices had a RT greater than this value. We then extracted the number of the start of the slice from the number of all possible slices (e.
g., if it was the slice 19 to 28, we extracted 19 from 36, getting 17), thus getting the number of remaining slices, where median RTs are already equal to
or below the critical value. As the maximum of remaining slices is 35 (e.g., when a participant reaches the critical value immediately with the first
slice), we divided the number of remaining slices with 35. As a result, we got a number between 0 and 1.
The RND-TRN3 difference score is calculated from the difference between the random block (Block RND) and the last training block (Block TRN3):
the median RT of Block TRN3 is extracted from the median RT of Block RND. As the time window was 1200 ms in the target detection task, and thus the
largest possible difference between the two blocks is 1200 ms, we divided this number with 1200. As a result, we got a number that could vary between
− 1 and 1 (where negative values mean that RTs were smaller in Block RND). We included this score as it can indicate a learning effect irrespectively of
the presence or absence of the RT decrease through the training blocks.
The TRN3 RT spread score is calculated from the interquartile range of RTs in Block TRN3. As the time window was 1200 ms, and thus the largest
possible interquartile range was also 1200 ms, we divided the TRN3 RT interquartile range with 1200, and extracted this number from 1. As the
variability of this number was relatively small among participants (between 0.69 and 0.98), we took its square to magnify its effect. As a result, we got
a number that could vary between 0 and 1. We included this score as we hypothesized that stable and less scattered RTs in Block TRN3 indicate more
successful learning through the training blocks.
Appendix B. Questions in the debriefing session translated to English
1. What do you think, what was the aim of the experiment?

2. In the first part of the experiment (when you had to respond to one syllable), did you notice any difference between the blocks? Did you notice any
difference between the 4th (the penultimate) and the other blocks?
3. In the first part of the experiment there was a pattern in most of the blocks. Did you notice it? If so, can you explain it?
4. This pattern was that most of the blocks consisted of four trisyllabic words. Can you recall them?
5. What do you think, what was the aim of the second part of the experiment (when you had to choose the more familiar one between two trisyllabic
sequences)?
6. Is there anything else you would like to mention related to the experiment?
References Jacquemot, C., & Scott, S. K. (2006). What is the relationship between phonological
short-term memory and speech processing? Trends in Cognitive Sciences, 10(11),
480–486. https://doi.org/10.1016/j.tics.2006.09.002
Arciuli, J., Torkildsen, J.v. K., Stevens, D. J., & Simpson, I. C. (2014). Statistical learning
Kabak, B., Maniwa, K., & Kazanina, N. (2010). Listeners use vowel harmony and word-
under incidental versus intentional conditions. Frontiers in Psychology, 5. https://doi.
final stress to spot nonsense words: A study of Turkish and French. Laboratory
org/10.3389/fpsyg.2014.00747
Phonology, 1(1), 207–224.
Arnon, I. (2019). Do current statistical learning tasks capture stable individual
Kaufman, S. B., DeYoung, C. G., Gray, J. R., Jiménez, L., Brown, J., & Mackintosh, N.
differences in children? An investigation of task reliability across modality. Behavior
(2010). Implicit learning as an ability. Cognition, 116(3), 321–340. https://doi.org/
Research Methods. https://doi.org/10.3758/s13428-019-01205-5
10.1016/j.cognition.2010.05.011
Batterink, L. J. (2017). Rapid statistical learning supporting word extraction from
Kidd, E. (2012). Implicit statistical learning is directly associated with the acquisition of
continuous speech. Psychological Science, 28(7), 921–928. https://doi.org/10.1177/
syntax. Developmental Psychology, 48(1), 171–184. https://doi.org/10.1037/
0956797617698226
a0025405
Batterink, L. J., & Paller, K. A. (2017). Online neural monitoring of statistical learning.
Langus, A., Marchetto, E., Bion, R. A. H., & Nespor, M. (2012). Can prosody be used to
Cortex, 90, 31–45. https://doi.org/10.1016/j.cortex.2017.02.004
discover hierarchical structure in continuous speech? Journal of Memory and
Batterink, L. J., Reber, P. J., Neville, H. J., & Paller, K. A. (2015). Implicit and explicit
Language, 66(1), 285–306.
contributions to statistical learning. Journal of Memory and Language, 83, 62–78.
Martin, R. C., Lesch, M. F., & Bartha, M. C. (1999). Independence of input and output
https://doi.org/10.1016/j.jml.2015.04.004
phonology in word processing and short-term memory. Journal of Memory and
Bogaerts, L., Siegelman, N., & Frost, R. (2016). Splitting the variance of statistical
Language, 41(1), 3–29. https://doi.org/10.1006/jmla.1999.2637
learning performance: A parametric investigation of exposure duration and
Misyak, J. B., & Christiansen, M. H. (2012). Statistical learning and language: An
transitional probabilities. Psychonomic Bulletin & Review, 23(4), 1250–1256. https://
individual differences study: Individual differences in statistical learning. Language
doi.org/10.3758/s13423-015-0996-z
Learning, 62(1), 302–331. https://doi.org/10.1111/j.1467-9922.2010.00626.x
Christiansen, M. H. (2018). Implicit statistical learning: A tale of two literatures. Topics in
Miyake, A., & Friedman, N. P. (2012). The nature and organization of individual
Cognitive Science. https://doi.org/10.1111/tops.12332
differences in executive functions: Four general conclusions. Current Directions in
Conway, C. M., Bauernschmidt, A., Huang, S. S., & Pisoni, D. B. (2010). Implicit
Psychological Science, 21(1), 8–14. https://doi.org/10.1177/0963721411429458
statistical learning in language processing: Word predictability is the key. Cognition,
Morgan, J. L., Meier, R. P., & Newport, E. L. (1987). Structural packaging in the input to
114, 356–371. https://doi.org/10.1016/j.cognition.2009.10.009
language learning: Contributions of prosodic and morphological marking of phrases
Franco, A., Eberlen, J., Destrebecqz, A., Cleeremans, A., & Bertels, J. (2015). Rapid serial
to the acquisition of language. Cognitive Psychology, 19(4), 498–550.
auditory presentation: A new measure of statistical learning in speech segmentation.
Nissen, M. J., & Bullemer, P. (1987). Attentional requirements of learning: Evidence from
Experimental Psychology, 62(5), 346–351. https://doi.org/10.1027/1618-3169/
performance measures. Cognitive Psychology, 19(1), 1–32. https://doi.org/10.1016/
a000295
0010-0285(87)90002-8
Frost, R., Armstrong, B. C., Siegelman, N., & Christiansen, M. H. (2015). Domain
Perruchet, P. (2018). What mechanisms underlie implicit statistical learning?
generality versus modality specificity: The paradox of statistical learning. Trends in
Transitional probabilities versus chunks in language learning. Topics in Cognitive
Cognitive Sciences, 19(3), 117–125. https://doi.org/10.1016/j.tics.2014.12.010
Science. https://doi.org/10.1111/tops.12403
Gaetano, J. (2018). Holm-Bonferroni sequential correction: An Excel calculator (1.3)
Perruchet, P., & Pacton, S. (2006). Implicit learning and statistical learning: One
[Microsoft Excel workbook]. Retrieved from https://www.researchgate.net/publicat
phenomenon, two approaches. Trends in Cognitive Sciences, 10(5), 233–238. https://
ion/322568540_Holm-Bonferroni_sequential_correction_An_Excel_calculator_13.
doi.org/10.1016/j.tics.2006.03.006
Howard, D., & Nickels, L. (2005). Separating input and output phonology: Semantic,
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old
phonological, and orthographic effects in short-term memory impairment. Cognitive
infants. Science, 274(5294), 1926–1928. https://doi.org/10.1126/
Neuropsychology, 22(1), 42–77. https://doi.org/10.1080/02643290342000582
science.274.5294.1926
Hunt, R. H., & Aslin, R. N. (2001). Statistical learning in a serial reaction time task:
Saffran, J. R., Newport, E. L., & Aslin, R. N. (1996). Word segmentation: The role of
Access to separable statistical cues by individual learners. Journal of Experimental
distributional cues. Journal of Memory and Language, 35(4), 606–621. https://doi.
Psychology: General, 130(4), 658.
org/10.1006/jmla.1996.0032
Isbilen, E. S., McCauley, S. M., Kidd, E., & Christiansen, M. H. (2017). In Testing statistical
Seidl, A., Tincoff, R., Baker, C., & Cristia, A. (2015). Why the body comes first: Effects of
learning implicitly: A novel chunk-based measure of statistical learning (pp. 564–569).
experimenter touch on infants’ word finding. Developmental Science, 18(1), 155–164.
13
Shanks, D. R., & Perruchet, P. (2002). Dissociation between priming and recognition in Siegelman, N., Bogaerts, L., Kronenfeld, O., & Frost, R. (2018). Redefining “learning” in
the expression of sequential knowledge. Psychonomic Bulletin & Review, 9(2), statistical learning: What does an online measure reveal about the assimilation of
362–367. https://doi.org/10.3758/BF03196294 visual regularities? Cognitive Science, 42, 692–727.
Siegelman, N., Bogaerts, L., Elazar, A., Arciuli, J., & Frost, R. (2018). Linguistic Siegelman, N., & Frost, R. (2015). Statistical learning as an individual ability: Theoretical
entrenchment: Prior knowledge impacts statistical learning performance. Cognition, perspectives and empirical evidence. Journal of Memory and Language, 81, 105–120.
177, 198–213. https://doi.org/10.1016/j.jml.2015.02.001
Siegelman, N., Bogaerts, L., & Frost, R. (2017). Measuring individual differences in
statistical learning: Current pitfalls and possible solutions. Behavior Research
Methods, 49(2), 418–432. https://doi.org/10.3758/s13428-016-0719-z
14

Tracking Statistical Learning Online Word Segmentation in A Target Detection Task

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tracking Statistical Learning Online Word Segmentation in A Target Detection Task

Uploaded by

Copyright:

Available Formats

Acta Psychologica 215 (2021) 103271

Contents lists available at ScienceDirect