You are on page 1of 19

Research on Language and Social Interaction

ISSN: 0835-1813 (Print) 1532-7973 (Online) Journal homepage: https://www.tandfonline.com/loi/hrls20

Automated Transcription and Conversation


Analysis

Robert J. Moore

To cite this article: Robert J. Moore (2015) Automated Transcription and Conversation
Analysis, Research on Language and Social Interaction, 48:3, 253-270, DOI:
10.1080/08351813.2015.1058600

To link to this article: https://doi.org/10.1080/08351813.2015.1058600

Published online: 11 Aug 2015.

Submit your article to this journal

Article views: 1254

View related articles

View Crossmark data

Citing articles: 3 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=hrls20
RESEARCH ON LANGUAGE AND SOCIAL INTERACTION, 48(3), 253–270, 2015
Copyright © 2015 IBM Corporation
ISSN: 0835-1813 print / 1532-7973 online
DOI: 10.1080/08351813.2015.1058600

FEATURED DEBATE: AUTOMATED TRANSCRIPTION

Automated Transcription and Conversation Analysis


Robert J. Moore
IBM Research–Almaden, California

This article explores the potential of automated transcription technology for use in Conversation
Analysis (CA). First, it applies auto-transcription to a classic CA recording and compares the output
with Gail Jefferson’s original transcript. Second, it applies auto-transcription to more recent record-
ings to demonstrate transcript quality under ideal conditions. And third, it examines the use of auto-
transcripts for navigating big conversational data sets. The article concludes that although standard
automated transcription technology lacks certain critical capabilities and exhibits varying levels of
accuracy, it may still be useful for (a) providing first-pass transcripts, with silences, for further
manual editing; and (b) scaling up data exploration and collection building by providing time-based
indices requiring no manual effort to generate. Data are in American English.

What distinguishes Conversation Analysis (CA) from other fields is its preoccupation not with
language but with talk-in-interaction. Their interest in social action means that conversation
analysts examine more than just the words exchanged between people. They also examine how
those words are produced, especially their timing, and a range of nonlinguistic activities such as
laughing, gesturing, artifact use, and more. Conversation analysts are interested in the full set of
resources that people use on particular occasions to organize social interaction locally and
collaboratively. In order to study this domain of social phenomena, conversation analysts first
capture the observable details of naturally occurring talk through mechanical, audiovisual
recording and second represent those details through transcription to aid analysis.
Conversation analysts spend a great deal of time generating transcripts. CA as a field emerged
around the time that consumer compact cassette recorders hit the market in the 1960s. With cassette
recorders, conversation analysts spent hours manually transcribing audio recordings of naturally
occurring talk in their own distinctive style. Then with the emergence of consumer video camcorders

I would like to thank Tom Zimmerman for writing the script for converting the timecodes.
Correspondence should be sent to Robert J. Moore, IBM Research–Almaden, 650 Harry Road, San Jose, CA 95120.
E-mail: rjmoore@us.ibm.com
254 MOORE

in the 1980s, they branched out into capturing embodied dimensions of talk-in-interaction, such as
gesture and artifact use. This made the job of transcription even more challenging. With the
emergence of digital recording and editing software in the 1990s, the work of CA transcription
became somewhat easier. But despite these technological advances, CA transcription remains a
labor-intensive, manual process, which limits the scalability of the method.
Automatic speech recognition (ASR) offers a means to reduce the amount of labor required
for the transcription of audio recordings of talk. These technologies use statistical methods to
recognize spoken words and transcribe them mechanically. Might ASR tools relieve the con-
versation analyst from at least some of the work of transcription? Would this even be desirable?
Something is lost whenever human analysts do not produce the transcripts they examine. The
process of transcription itself is a powerful way for an analyst to become intimately familiar with
a recording of human activity. If we take Sacks’s (1984) assumption of “order at all points”
seriously, then we as analysts recognize that there are always features captured in the recording
that are not captured in the transcript and that some of these features may be consequential for
the analysis we are putting forward. The process of transcription forces analysts to observe the
fuller set of details from which the transcript is rendered, and consequently they may notice
unexpected features that must be represented.
However, it is also a somewhat common practice in Conversation Analysis to use transcripts
that were generated by someone else, whether by Gail Jefferson or by a research assistant. The
best practice in such cases is always to use the transcript along with the recording. Not only can
the analysts see features that the transcript misses, but they can make enhancements or correc-
tions to it. A transcript in CA is never a finished product. So whether a transcript is generated by
another person or by a machine, analysts should examine it in conjunction with the recording
and edit it as their particular inquiries require.
This article explores the potential of automated transcription for use in Conversation
Analysis. First, a brief overview of automated speech recognition technology is presented.
Second, a particular automated transcription tool will be demonstrated on a classic CA record-
ing, and its capabilities and limitations will be examined. Third, the tool will be demonstrated on
relatively long (11 hours total), high-quality recordings from current events, and the quality of
the resulting transcripts will be examined. And fourth, the potential of auto-transcripts for
navigating big conversational data sets and building collections of phenomena will be explored.

BACKGROUND

Enabling computers to recognize human speech has been a central problem of artificial intelligence for
decades. In the 1970s the field of automatic speech recognition (ASR) experienced a paradigm shift
when researchers from outside the field introduced a new controversial approach (see Picheny et al.,
2011). Researchers, who were not speech scientists themselves, began applying statistical methods
from information theory to the problem: “Word error rates [WER] . . . plummeted by a factor of 5—an
unheard of performance improvement in a recognition system” (Picheny et al., 2011, p. 2). Statistical
methods are now widely adopted in the field of ASR and are used in most commercial systems.
In this statistical approach, the speech recognizer consists of four components: a feature extractor,
an acoustic model, a language model, and a decoder (Picheny et al., 2011, pp. 2–3). The feature
extractor transforms the audio signal into discrete values, or mel-frequency cepstral coefficients
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 255

(MFCCs), which simulate the frequency resolution of the human ear. The acoustic model provides
probabilities of the occurrence of particular strings of phonemes given particular sequences of
sounds (or MFCC vectors), which are generated from a set of “training” data with labeled phonemes.
The language model provides probabilities of the occurrence of particular strings of words given
particular sequences of phonemes, which are also generated from labeled training sets of words. For
any new acoustic signal containing human speech, the decoder calculates the most likely sequence of
words for each utterance given the probabilities provided by the acoustic and language models.
Because the acoustic and language models are based on finite training sets, they can be further
adapted with additional data to improve performance, or “learn,” over time. However, because the
training sets require manual transcription and labeling, they remain relatively expensive to produce.
Despite breakthroughs in the field, ASR still faces multiple technical problems. “ASR
systems are not robust to noise, channel distortions, and accents. They only model localized
aspects of conversation and sometimes make egregious errors because of this lack of knowledge
of context” (Picheny et al., 2011, p. 2). As we will see, both overlapping talk (Shriberg, 2005)
and background noise (Cooke, 2006) pose fundamental problems for ASR.
Furthermore, even with perfect Word Error Rates (WER) of 0, automated transcripts are still
difficult to read without additional information, especially who spoke when. In order to facilitate
advances in ASR technology, the National Institute of Standards and Technology (NIST) sponsored
a series of evaluations of speech technologies over the past decade on the topic of “rich transcription”
(NIST, 2004, 2009). Central to these challenges has been speaker diarization, which consists of a
segmentation task, detecting speaker changes, and a clustering task, labeling which segments were
produced by the same speaker. Techniques are evaluated with a Diarization Error Rate (DER) metric.
In addition, NIST’s Rich Transcription evaluation specifies several kinds of “edit disfluencies”—
revisions, repetitions, and restarts—as well as so-called “pause fillers,” or discourse markers, that
future ASR systems should be able to detect and label automatically (NIST, 2004, p. 3). To date the
ASR research community has not converged on a single approach to either speaker diarization
(Knox, 2013) or disfluency detection (Liu et al., 2006). Methods for the automated detection of such
speech metadata are not as mature as methods for recognizing the words themselves.
Few studies in CA have dealt with the topic of automated transcription. In one such study, David,
García, Rawls, and Chand (2009) examine ASR technology in the context of medical transcription.
They found that practitioners are oriented to two distinct concerns when editing auto-transcripts of
doctors’ dictations of health records: (a) correcting transcription errors made by the technology; and
(b) correcting medical errors made by the physician, such as dictating the wrong drug. Thus, medical
transcriptionists do much more than simply recognize the physicians’ spoken words: They apply
their knowledge of medicine. In addition, their aim is not to capture what the physician says
verbatim, as a conversation analyst would, but to produce a valid medical record complete with
punctuation and document formatting. Although David et al. (2009) examine medical transcription,
they do not explore automated transcription for the purposes of doing Conversation Analysis itself.

DATA AND METHODS

In this study, an automated transcription server, internal to IBM Research and powered by IBM’s
“Attila” speech recognition engine (Soltau, Saon, & Kingsbury, 2010), was used to generate
transcripts of several sets of audiovisual recordings of naturally occurring talk. Although the
256 MOORE

transcription server does not offer all of the capabilities desired, especially speaker segmentation, it
does offer a state-of-the-art speech recognition engine that is also used in leading commercial
packages.
The Attila transcription server provides output in multiple file formats, including the standard
format for subtitling videos (SubRip), which consists of numbered utterances and strings of
words, along with their starting and ending timecodes.
52
00:01:30,004 → 00:01:35,714
oh god long week oh my god i’ve decided sober i want you to have a t. v. i won’t either

This subtitling format, which is intended to be superimposed on a video image, is relatively


difficult to read as a transcript. To improve readability, a simple Python script was created to
convert the timecode information to timed silences following the convention in CA (see Excerpt
1A). In addition, line breaks were inserted automatically after silences of specified lengths: 0.3 s
or greater for telephone calls or 0.5 s for government proceedings. Similarly, NIST (2004, p. 11)
uses 0.5 s as a turn boundary for speaker segmentation tasks: “Although somewhat arbitrary, the
cutoff value of 0.5 seconds has been determined to be a good approximation of the minimum
duration for a pause in speech resulting in an utterance boundary.” Such arbitrary cutoffs make
the raw transcripts much easier to read and can later be corrected manually to reflect actual turn
boundaries. Finally the automated transcripts were manually corrected with the aide of the
speech-analysis tool, Wavesurfer, specifically its ability to generate spectrograms.
Spectrograms are used in the field of phonetics and provide rich visualizations of many features
of human speech (Ladefoged, 1993; Walker, 2013).
Only public sources of data are used in this article. One source known as the “Newport
Beach” corpus among conversation analysts was recorded in 1968 and transcribed by Gail
Jefferson (source: TalkBank.org). Additional data were obtained from C-SPAN’s online video
library and include video recordings of the 2011 Debt Ceiling Debate in the U.S. House of
Representatives (ID 300758-2-MP3-STD), the 2013 Congressional hearing on Benghazi with
former Secretary of State Hillary Clinton (ID 310496-1-MP4-STD), and the 2013 hearing on the
Dodd-Frank law with Senator Elizabeth Warren (ID 310990-1-MP4-STD). These recordings
were selected for their public availability, recording quality, and length (over 11 hr total), and
they will be used to demonstrate the performance of automated transcription.

Auto-Transcripts and Jeffersonian Transcripts

In this section, an auto-transcript of a classic CA recording, from the Newport Beach corpus
(1968), is compared to Gail Jefferson’s original transcript. The auto-transcript was manually
altered in one way: The turns have been segmented for the purposes of line-by-line comparison.
In the raw auto-transcript, if no timed silences appear between utterances, those utterances
appear as a continuous line, making it difficult to read. In comparing these two transcripts, we
explore the following dimensions of CA transcription: word recognition, speaker segmentation,
overlapping talk, phonetic representation, prosody, and silence.
Excerpt 1 occurs 1 min and 30 s into a telephone call between two friends, Emma and Lottie.
In their casual conversation, they refer to the recent assassination of Robert F. Kennedy.
(Transcription errors in bold.)
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 257

1A. [NB-Assassination1:00:01:30:AUTO]
69 oh god long week
70 oh my god
71 i’ve decided sober i want you to have a t. v.
72
73 i won’t either
73.5 (0.7)
74 like uh you know (0.1) that’s where they
75 we took off on our charter flight that same spot
76 did you see it
77 (0.8)
78 and they took him and here uh you
79 know i wouldn’t
80 watch it
81 i think it’s so ridiculous i mean it’s (0.4) it’s a horrible
82 thing but my god (0.1) play up that’s thing it’s it’s (.)
83 horrible
84 die people that
84.5 (0.3)
85 why is it a native american people think well they’re no good
85.5 (0.5)
86 well they aren’t very good some of
1B. [NB-Assassination1:00:01:30:JEFFERSON]
69 Lot: Oh: ↓Go:d a lo:ng wee[k. Yeah.]
70 Emm: [O h : my]↓God
71 I’m (.) glad it’s over I won’t even turn the teevee
72 o[n.
73 Lot: [I won’eether.
74 Emm: °aOh no. They drag it out so° THAT’S WHERE THEY
75 WE TOOK OFF on ar chartered flight that sa:me spot
76 didju see it↗
77 (0.7)
78 Emm: ·hh when they took him in[the airpla:ne,]
79 Lot: [n : N o : : : . ] Hell I wouldn’ ev’n
80 wa:tch it.
81 Lot: I think it’s so ridiculous. I mean it’s ·hhh it’s a hôrrible
82 thing but my: Go:d. play up that thing it it’s jst
83 ↑hôrri[ble. ]
84 Emm: [It’ll] drive people nu:ts.
85 Lot: Why id ï-en makes Americ’n people think why ther no goo:d.
86 Emm: °°Mm:°° Well they aren’t very good some of’m,

The most fundamental task in any transcription is word recognition and representation. In
Excerpt 1, we can see that the automatic transcription technology correctly represents many of
the words. Unlike Jefferson, the tool represents the words in standard orthography. However, the
technology also makes several mistakes. It misses multiple phrases at line 71, getting only “i”
and “t.v.” correct. More word errors occur in bold on lines 74, 78, 79, 84, and 85. We can see
258 MOORE

that the system’s errors are not entirely random but often are phonetically similar: For example,
“it’s over” gets misrepresented as “sober” (line 71). Human transcriptionists of course also make
such recognition errors but are no doubt better able to use knowledge of the context of the
surrounding talk to disambiguate alternative hearings.
In the ASR literature, word error rate (WER) is routinely used to summarize and evaluate a
tool’s word recognition performance. WER captures three types of transcription errors and is
calculated as follows:

100 ðInsertions þ Substitutions þ DeletionsÞ


Word Error Rate ¼
Total Words
From the 107 words in Excerpt 1, the system produced five insertions, 21 substitutions, and 12
deletions, resulting in a WER of 36%. This is a fairly high score. The poor performance in this
case is no doubt due in part to the fact that overlapping talk was included in the scoring and that
the quality of the recording, captured in 1968, is relatively low. While WER provides a high-
level summary of performance, the aim here is to explore the local circumstances under which
the tool performs well and poorly.
Another fundamental feature of transcription for Conversation Analysis is speaker segmenta-
tion, or labeling who spoke when. Identifying when speaker change occurs and when a previous
speaker speaks again is essential—for example, for analyzing turn taking (Sacks, Schegloff, &
Jefferson, 1974). As noted, automated techniques for such speaker segmentation, or diarization,
are less mature than word recognition techniques, and consequently the ASR community has yet
to converge on a single approach. Unfortunately, the auto-transcription tool used in this study
lacks speaker segmentation capability; therefore, the analyst cannot easily determine turn-taking
events from the raw output alone. For example, at the beginning of Excerpt 1A, lines 69–73 all
appear as a single line in the raw output, even though they include two speaker changes (lines 70
and 73). Similarly, lines 78–81 appear as one line in the raw output, since they contain no timed
silences, even though they contain one speaker change (line 79). Information about which
speaker produced each word must be added manually.
In naturally occurring speech, overlapping talk routinely occurs, especially near transition
relevance places (Sacks et al., 1974), and in CA these overlaps are marked with square brackets
around the talk of each speaker. However, since the auto-transcription tool cannot distinguish
different voices, it processes all overlapping voices as a single voice. This causes word
recognition errors (Shriberg, 2005). For example, in the auto-transcript, “the airplane” (line
78) overlaps a stretched “no” (line 79), and both utterances are rendered as a single utterance:
“and here uh you know.” Again we can see the tool picking up on elements of the talk
phonetically, “know” instead of “no,” but the result is unanalyzable. Similarly on lines 69–70,
72–73, and 83–84, words in the auto-transcript are missed due to overlapping speech.
One of the more distinctive characteristics of Jefferson’s transcription style is the phonetic
representation of some words. She often uses nonstandard spellings of words in order to capture
the “pronunciational particulars” (Jefferson, 1983) of the talk. Jefferson offers something in
between standard orthography and phonetic transcription. She argues that capturing such
phonetic detail is important “because it’s there” and because sometimes it can be shown to
influence the trajectory of the interaction. The auto-transcription tool, in contrast, tends to use
standard orthography, as we can see in Excerpt 1A. However, the tool can also output common
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 259

contractions like wanna, gotcha, kinda, and gonna, as well as tokens including, uhhuh, uh, and
um. Including nonstandard representations in the tool’s training data enables it to produce them.
Capturing the prosodic features of talk also distinguishes CA transcripts from more standard
forms of transcription. These features include stress, pitch, loudness, tempo, and elongation of
words or syllables (Jefferson, 2004). Emphatic stress is indicated by underscoring, pitch changes
by punctuation or arrows, louder talk by caps, quieter talk by degree symbols, faster talk by
greater-than and less-than symbols, and stretch by colons (see Excerpt 1B for instances of each).
Prosodic features of talk are represented in CA because they have been shown to be conse-
quential; that is, they are used by speakers to accomplish various actions such as turn transition,
affiliation/disaffiliation, “astonishment,” and much more (Couper-Kuhlen & Selting, 1996).
Like speaker segmentation, automated methods for detecting “prosodic cues” are less mature
than those for word recognition and remain topics for NIST’s Rich Transcription evaluation
series (NIST, 2009). Research in this area, for example, attempts to use prosodic features, such
as changes in pitch, to help detect utterance boundaries and determine appropriate punctuation
(period, comma, question mark) for more standard transcripts (Liu et al., 2006). Unfortunately,
the auto-transcription tool used here does not include any capabilities for detecting and repre-
senting prosody.
Despite the multiple features of talk that the auto-transcription tool does not capture, the
technology can represent silences between words, which are of major import for CA. Perhaps
one of the most distinctive features of Jeffersonian transcription (Jefferson, 2004) is representa-
tion of the silences between words and phrases, measured in tenths of seconds. Capturing
silences enables conversation analysts to observe gaps, pauses, and lapses between turns and
to examine phenomena such as turn taking (Sacks et al., 1974), preference structure (Pomerantz,
1984), and much more.
In practice there are alternative methods used for measuring silences in Conversation
Analysis. The original counting method involves “measuring silence relative to speech rhythm”
(Hepburn & Bolden, 2013, p. 60); therefore, according to this method, the numbers in parenth-
eses do not necessarily represent seconds. In addition, silences of less than 0.2 s are considered a
“normal” length of time between turns, rather than a delay, and are often omitted from the
transcript (Hepburn & Bolden, 2013, p. 61). However, later methods employed stopwatches and
eventually computer-assisted measurement involving visualizations such as waveforms and
spectrograms. These methods rely on absolute measures of time, but they are still captured
manually, and so they can vary from transcriptionist to transcriptionist (Roberts & Robinson,
2004). Using these semimechanical methods, transcriptionists must decide whether or not to
include short silences of less than 0.2 s between or within turns. Automated tools include all
detected silences in the transcript, so they tend to represent more such short silences than human
transcriptionists, but they likely enable better consistency.
As noted, the subtitling format (SubRip) represents the onset and offset times for each
utterance. This time information can easily be converted to the CA format of timed silences,
making the magnitude of silences more visible at a glance than raw timecodes. In the auto-
transcript (1A), we can see timed silences at lines 73.5, 74, 77, 81, 82, 84.5, and 85.5. At first
glance, these silences in the auto-transcript do not appear particularly accurate. At lines 73.5 and
74, it produces silences that do not appear in Jefferson’s transcript. However, we can also see
that these two silences appear on either side of an incorrectly transcribed segment, “like uh you
know” (line 74). The system’s ability to recognize the actual phrase, “aOh no. They drag it out
260 MOORE

so” (line 74), is no doubt compromised by the fact that it is softly spoken, as indicated by the
degree symbols, and by the contracted form at the beginning. In addition, adjacent to another
word recognition error, “that” instead of “nu:ts” (line 84), the auto-transcript indicates a 0.3-s
silence (line 84.5) where Jefferson does not. The system also sometimes inserts a silence when it
misses a word entirely. For example, it misses the “jst” (line 82) and instead inserts a micro-
pause. The lesson here seems to be that if silences occur adjacently to word recognition errors,
their length may be wrong.
Differences in the representation of silences between the auto-transcript and the Jefferson
transcript also occur in the vicinity of audible breaths. The automated transcription tool, with its
existing training data, ignores all audible breath sounds. So for example, we see a slightly
longer silence that is adjacent to an audible in-breath (lines 77–78) and a 0.4-s silence in place of
a “.hhh” (line 81). In fact, in the copy of the recording used in this study, there is no evidence of
these in-breaths.
Audible breath sounds are identifiable on spectrograms. For example, Figure 1 offers a
spectrogram1 containing an audible in-breath. In it, a breath sound (.h) is clearly visible as
mid- to high-frequency frication or noise (box). The in-breath is 0.23 s long and is followed by a
silence of 0.17 s. If we make each “h” equal to approximately 0.2 s and round the length of
silence to the nearest 10th, we get a transcription of “.h (0.2)” occurring in between “kill some
americans” and “what.” Contrast this with the spectrogram from Excerpt 1 (lines 76–78) in
Figure 2. It shows no evidence of an audible breath between the words “it” (ending 1:41.4) and
“when” (beginning 1:42.1). Instead it shows only a 0.84-s silence. Nor is there audible evidence
when the recording is played back. The same occurs in Figure 3 from Excerpt 1 (lines 81–82).
Although Jefferson transcribes a silence as “.hhh” (line 81), there is no visual evidence of the
in-breath on the spectrogram (Figure 3) between “it’s” (ending 1:47.6) and “it’s” (beginning
1:47.9), nor auditory evidence when played back.2 These could be errors in Jefferson’s

FIGURE 1 In-breath on spectrogram.

1
All spectrograms have a frequency range of 0–5000 Hz on the vertical axis.
2
Similarly the very quiet “Mm:” (line 86) is not audible or visible.
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 261

FIGURE 2 No in-breath on spectrogram.

FIGURE 3 No in-breath on spectrogram.

transcript, or they could be the result of the low-quality recording used here. In the spectro-
gram we can see, and in the audio playback hear, quiet, mid- to high-frequency background
noise throughout the entire recording. This noise could be masking an in-breath that Jefferson
could have heard in a cleaner copy of the recording. This is supported by the fact that the
sibilant sounds at the end of both occurrences of “it’s” in Figure 3 are likewise not visible or
audible. Either way, the tool does the right thing by representing silence in these cases. As we
will see in the next section, when word recognition is high, the accuracy of the timed silences
in the auto-transcript tends to be quite high, with the caveat that breath sounds are entirely
ignored.
We see then that of the six dimensions of Jeffersonian transcription discussed—word
recognition, speaker segmentation, overlapping talk, phonetic representation, prosody, and
silence—the auto-transcription tool offers only the first and the last. In addition to demonstrating
what auto-transcription can and cannot do, this section has attempted to explore some of the
local circumstances that can cause trouble for auto-transcription, such as overlaps, quiet talk, and
background noise.
262 MOORE

Auto-Transcript Quality

While the prior section compared the format of auto-transcripts with Jeffersonian conventions,
this section examines the quality of auto-transcripts when generated from more suitable record-
ings than the 1968 Newport Beach calls. These include three C-SPAN recordings—Debt Ceiling
Debate, Benghazi Hearing, and Dodd-Frank Hearing—that together last over 11 hr and from
which over 17,000 lines of auto-transcript were generated with no manual effort. While the
quality of word recognition in the previous section was rather low (WER 36%), the C-SPAN
recordings result in significantly higher quality overall.
Excerpt 2 comes from the Debt Ceiling Debate. Version A is the output of the auto-
transcription tool. Errors in word recognition and silence length are highlighted in boldface.
Version B is a manually corrected version of A. Speaker segmentation, emphatic stress, pitch
changes, and sound stretches have also been added, although not in as much detail as Jefferson’s
transcripts, and audible breath sounds have been intentionally ignored.

2A. [DebtCeiling:03:02:05:AUTO:WER:2%]
01 last november we all know that (0.1) there was an overwhelming
02 message that was sent by the american people (0.2) to washington
03 d. c. (0.4) and that message was number one (1.0) create jobs
04 (.) get our economy back on track (.) and in so doing (0.5)
05 rain in (0.6) the dramatic increase (0.2) in the size (.) and
06 scope (.) and reach of government that we witnessed in the past
07 several years
08 (0.5)
09 we all know in the last four years we’ve had (0.3) an eighty
10 two percent increase (.) in non defense (0.4) discretionary
11 spending an eighty two percent increase
12 (0.6)
13 and so the message that was sent (.) was (0.3) that has to come
14 to an end.
15 (0.9)
2B. [DebtCeiling:03:02:05:CORRECTED]
01 DD: last november we all kno:w that (0.2) there was an overwhelming
02 message that was sent by the american people, (0.3) to washington
03 dee cee. (0.5) and that message was number one:, (1.0) create jo:bs
04 (.) get our economy back on track, (.) and in so doi:ng, (0.7)
05 r:ein in:, (0.7) the dramatic increas:e (0.3) in the si:ze (.) and
06 sco:pe (.) and re:ach of government that we’ve witnessed in the past
07 several year:s.
08 (0.7)
09 we all know in the last fo:ur years we’ve had uh:- (.) an eighty
10 two: percent increase (.) in no:n defense: (0.4) discretionary
11 spending. an eighty two percent increase.
12 (0.7)
13 and so the message that was sent (.) wa:s (0.3) that has to come
14 to an end.
15 (1.0)
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 263

We can see in this excerpt that the quality of the auto-transcript is very high (WER 2%). In terms
of word recognition, there are only two very minor errors, “rain” instead of “rein” (line 05) and
“we” instead of “we’ve” (line 06). In terms of timed silences, most are only 0.1 s short compared
to manual timing with a spectrogram.
The high quality in the auto-transcript is due to a few factors. First, the training data from
which the statistical models were derived were “broadcast news” recordings, which are similar in
form to these Congressional debates and hearings. Second, speakers in Congressional debates
tend to talk clearly and relatively slowly. This makes it easier for automated transcription, as well
as for humans, to recognize the words. Third, Congressional debates consist primarily of a series
of monologues. The participants use a formal turn-taking procedure, different from that of
ordinary conversation (Sacks et al., 1974), which involves fixed-length turns and minimizes
interruptions. Within these extended turns, there tend to be very few overlaps. And fourth, the
quality of the recordings themselves is very high.
Although not as accurate as that of the Congressional debate, auto-transcripts of the
Congressional hearings on Benghazi and Dodd-Frank are also quite high in quality. While
such hearings also contain long monologues, they contain more question-answer sequences
during which overlaps occur, which cause more errors. Take the following memorable exchange,
from the 2013 Benghazi Hearing, between Senator Ron Johnson and former Secretary of State
Hillary Clinton:

3A. [Benghazi:01:23:32:AUTO:WER:29%]
14 but (0.1) let me know (.)
15 when people (0.2) don’t know
16 within days (.)
17
18 and i mean (0.1) you know (.)
19 with all due respect (0.1) the fact is we have four dead
20 (.) americans what (0.1) kind of a protester was a
21
22 because of guys out for a walk when i decide
23 they go kill some americans
24 (0.4)
25 what (0.1) difference at (.) this point does it make
26 (0.2) it is (0.1) our job (0.1) to figure out what
27 happened and do (.) everything we can (0.1)
28 to prevent it from ever happening again senator now
29 (0.5)
3B. [Benghazi:01:23:32:CORRECTED]
14 HC: =but (0.1) but [ you know ]
15 RJ: [>and the ameri]can people could’ve< known
16 that within days,=
17 HC: =and=
18 RJ: =and they would (.) they didn’t know that.=
19 HC: =with all due respect. (0.2) the fact i:s we have four dead
20 (.) americans [was it because of a] (.) protest or was it=
21 RJ: [ i understand ]
264 MOORE

22 HC: =because of guys out for a walk one night who decided
23 they’d go kill some americans.
24 (0.4)
25 HC: what (0.2) difference at (.) this point does it make.
26 (0.2) it is (0.1) our job (0.2) to f:igure out what
27 happened and do (0.1) everything we can (0.2)
28 to prevent it from ever happening again senator. now
29 (0.6)

In Excerpt 3A, we can see multiple word recognition errors (bold) caused by overlapping talk
(lines 14–15, 20–21). Errors also occur where there are tightly interspersed turns (lines 16–19).
However, where there is extended talk by a single speaker, there are many fewer errors (lines
22–29). This results in an overall WER of 29% for this excerpt. WER may at times overestimate
errors. For example, the minor error, “protester” instead of “protest or” (line 20) counts as two
errors: a substitution and a deletion, even though it is very close to correct. In addition, the timed
silences that are not adjacent to word errors are highly accurate and tend to be only 0.1 s shorter
than those measured with the manual spectrogram method.
Comparable levels of accuracy can also be found in the Dodd-Frank Hearing. In the
following exchange, Senator Elizabeth Warren grills federal bank regulator, Thomas Curry:

4A. [Dodd-Frank:01:37:40:AUTO:WER:19%]
01 i’m (0.2) sorry to interrupt i just wonder this long
02 (.) it’s effective police settlement (0.2) and what i’m
03 asking is (0.4) when did you last (.) may (0.3)
04 you haven’t been there forever so i’m really asking that
05 the s. e. c. (0.4) a large financial (.)
06 institutions (.) a wall street bank (0.2) both to trial
07 (.)
08 the institutions i supervise national banks of federal
09 troops we’ve actually had a (0.1) fairly (0.1) a fair
10 number of (0.3) consent orders
4B. [Dodd-Frank:01:37:40:CORRECTED]
01 EW: i’m sorry to interrupt but i just wan’ move this alo:ng
02 (.) that’s effectively a settlement. (0.4) and what i’m
03 a:sking i:s (0.4) when did you la:st ta:ke- (.) an’ I know
04 you haven’t been there forever so i’m really asking about
05 thee oh cee cee:. (0.4) a large financial (.)
06 insti[t ]ution,(.) a wall street bank,(0.2) [ to t ]rial.
07 TC: [we] [well uh]
08 (.)
09 TC: thee institutions i supervise national banks and federal
10 thrifts we’ve actually had uh: (0.1) fairly (0.1) a: fair
11 number of uh (.) consent orders.
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 265

In Excerpt 4 we see about the same number of word recognition errors as in the previous one with a
lower WER of 19%. Only one error occurs at an overlap (lines 06–07) and one timing error that
occurs adjacent to a word recognition error (line 3). We can also see the system misrepresents a
technical banking term, “federal thrift” (lines 8–9), as a more common term, “federal troops.”
We see then in this section that the accuracy of word recognition and timed silences in auto-
transcripts can be quite high. The raw output (A) on its own is very readable. In addition, the
amount of work involved in correcting and enhancing it (B) is considerably less than that
required to transcribe the whole excerpt from scratch.

Auto-Transcripts and Collection Building

When the accuracy of auto-transcripts is adequate, as in the three recordings discussed in the
previous section, it is possible for the analyst to use them for data navigation, exploration, and
collection building. One of the fundamental methods in Conversation Analysis is working with
“collections” (Schegloff, 1996). Analysts identify a pattern in the data that they believe might
constitute a recurrent phenomenon and then search the data set and others for more instances of
it. In addition to exemplars, or “clean cases,” of the phenomenon, analysts assemble collections
“generously,” including instances that appear different from the core collections. “This will
allow us—indeed force us later on, when we discard these instances, to make explicit just what it
is which makes them different from our targets” (Schegloff, 1996, pp. 176–177). Working with
collections enables analysts to identify the range of trajectories a particular conversational
practice can take on different occasions and under different circumstances. From collections,
analysts abstract general accounts of how the practice works.
When the timecodes are transformed into CA-style silences, and silences of less than 0.5 s are
partitioned to the same line, the resulting auto-transcript is quite readable and explorable, even
though it contains errors. In addition, the subtitle files (SubRip; see previous), with their
standard timecode format, provide a searchable index for the recording. Using both together,
the analyst can browse the auto-transcript for potentially interesting segments and then use the
auto-index to pinpoint exactly where those segments occur in the recording. The result is an
iterative exploration process in which the analyst moves between auto-transcript, auto-index, and
recording.3 Some video players (e.g., VLC), provide a “Go To Time” function with which the
analyst can enter the timecode from the index file and skip straight to the segment of interest in
the video file. In addition, the subtitle file can be imported into some players and the auto-
transcript displayed on the picture as subtitles, making it easy to view the transcript and video at
the same time.
There are of course many different ways to explore a transcript of talk-in-interaction. One
general approach is to inspect it in an “unmotivated” manner, to see what is there, rather than
looking for a particular thing (Sacks, 1984). But once a phenomenon begins to emerge from
this unmotivated looking, the analyst then generally begins to hunt for more cases of it. Auto-
transcripts may be especially useful for this motivated hunting. When interactional phenom-
ena are rare, the more data in which one can search for them, the better. Conversation
analysts have used many of the same data sets over the years: for example, the Newport

3
Embedding the timecodes in the auto-transcript file so that they pop up when an utterance is moused over would
make the process even easier.
266 MOORE

Beach, Two Girls, Heritage, Holt and Rahman corpora, and more. When a generic phenom-
enon is discovered in one data set, these shared CA corpora are often searched for additional
cases. Auto-transcription could greatly expand this territory, in which analysts hunt for new
cases.
The automated tool produced over 17,000 new lines of transcript from the three C-SPAN
recordings with no manual effort. Although they do not provide as complete and correct a
transcript as the shared CA corpora, they may be good enough for discovering new cases of
some phenomena if they are used in conjunction with the recording and index files. To
demonstrate this possibility, these auto-transcripts were explored for new cases of two known
phenomena: well-prefaced answers and change-of-state tokens.
One well-known phenomenon in the CA literature is the use of well-prefacing as part of
“dispreferred” turn formats (Pomerantz, 1984). A search for the word well across the 17,000
lines of auto-transcript returns 122 occurrences of the token well. But since the auto-transcripts
include silence information, our search for well-prefaces can be narrowed by including a silence,
as represented by a closed parenthesis, before the token in order to locate wells that are more
likely to occur at the beginning of a turn. This addition narrows the results to 99. A manual
inspection of these results reveals that most cases of well following a short silence (<0.5 s) occur
within turns in the Debt Ceiling data, although not in the other two data sets. Cases of well
following a longer silence (≥0.5 s), on the other hand, tend to occur between turns even for the
Debt Ceiling data. Therefore, limiting our search to “wells following longer silences” will further
narrow our search from 99 to 48 places in the recordings where we might find well-prefacing.
Although a complete analysis of the remaining 48 candidate cases is beyond the scope of this
article, the point is simply that auto-transcripts can provide a method, albeit an imperfect one, for
locating candidate cases for collections.
Excerpt 5 provides one of the cases of well-prefacing discovered in the Benghazi data.

5A. [Benghazi:00:56:36:AUTO:WER:2%]
01 (0.5) i told the american people that quote (0.2)
02 heavily armed militants assault on the (.) compound (.)
03 in (0.1) and (0.2) vowed (.) to bring them to justice
04 (2.2) i’m assuming that you had rock
05 solid evidence to (0.6) to make such a bold statement
06 (0.1) at that (.) time
07 (0.8)
08 → well we had four dead people (0.5) and we had (0.2)
09 several (0.1) injured one seriously who still in
10 walter reed . . .
11 ((5 lines omitted))
12 . . . so we knew that (0.3) clearly (0.4) there was
13 (.) a (0.5) an attack (0.3)
5B. [Benghazi:00:56:36:CORRECTED]
01 JR: (0.5) i told the american people that quote (0.3)
02 heavily armed militants assaulted our (.) compound (0.1)
03 in- and (0.2) vowed (.) to bring them to justice.
04 (0.6) uh::m: (0.8) ah- i’m assuming that you had rock
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 267

05 solid evidence to: uh: eh- to make such a bold statement


06 uh at that (.) time.
07 (0.8)
08 HC: → well we had four dead people, (0.5) and we ha:d (0.2)
09 several: (0.2) injured one seriously who’s still in
10 walter ree:d, . . .
11 ((5 lines omitted))
12 . . . so we knew that (0.4) clearly (0.5) there wa:s
13 uh: (0.5) an attack? (0.3)

We can see that the auto-transcript is very accurate and readable. Only one minor word
recognition error occurs, “assault on the compound” instead of “assaulted our compound” (line
2): indeed an error that even an experienced transcriptionist might make on a first pass.
In this case, JR’s turn (lines 01–06) projects a confirmation as the preferred next action. But
instead of producing one, HC pauses and begins a reporting of the circumstances, “we had four
dead people . . . ” (lines 08–10). She then draws the upshot of the reporting as, “so we knew that
clearly there was an attack” (lines 12–13). By using “reporting” (Drew, 1984), HC avoids
aligning with loaded characterizations in the question: “rock solid evidence” and “such a bold
statement” (lines 04–05). The well-prefacing of the reporting is a feature of the response’s
“dispreferred” turn format (Pomerantz, 1984).
Another known phenomenon in the CA literature is the use of oh “to propose that its producer has
undergone some kind of change in his or her locally current state of knowledge, information,
orientation or awareness” (Heritage, 1984, p. 299). A search for oh generates many fewer results
than the one for well: only 12 instances in the three C-SPAN auto-transcripts. A quick manual
inspection of these cases reveals that 10 of them are the use of oh as the number zero. The other two
are instances of oh-prefacing: one “oh yeah” and the other simply “oh,” as we can see in Excerpt 6.

6A. [DebtCeiling:03:48:26:AUTO:WER:56%]
01 i just like it is yes jim what he might say
02 where that is from the labeling that literature
03
04 speaker
05 (1.0)
06 last (.) august
07 (0.3)
08 → oh i’m so i thought it yielded
09 (0.4)
6B. [DebtCeiling:03:48:26:CORRECTED]
01 DD: i’d just like to a’the g- ask the gentleman if he might cite,
02 (0.2) uh:m wh-where that is from: [the-thee quote of that?]
03 BL: [i eh mi- mister mi]ster
04 speaker uhm,
05 (0.6) ((inaudible voice of 3rd person))
06 BL: can I ask for s-
07 (0.3)
08 DD: → oh i’m sorry i thought he’d yielded.
09 BL: huh
268 MOORE

In this excerpt, we can see a breach of turn-taking protocol on the floor of the U.S. House of
Representatives. BL, a Democrat, had just characterized the Republican bill under debate as a
“reckless plan” and began to compare it to the “Ryan plan” (not shown), when DD, a
Republican, interrupts his turn and allotted time by asking BL to cite where something he just
mentioned “is from” (lines 01–02). BL addresses the acting House Speaker (lines 03–04), no
doubt beginning to appeal to him to intervene. A third person, most likely the House Speaker,
can barely be heard in the background (line 05). In response, BL begins a request (line 06), but
before he finishes, DD produces an oh-prefaced utterance, “oh I’m sorry I thought he yielded”
(line 08), which apologizes and offers an account for his speaking out of turn. The oh proposes a
“change-of-state” or realization on the part of DD. BL then chuckles (line 09) and proceeds to
request more time with the justification that “I don’t believe that I did yield” (not shown).
In Excerpt 6, there are several errors and omissions in the auto-transcript. First, the disfluency
and restart (line 01) cause transcription errors. Second, the overlap (lines 02–03) results in
additional errors. Third, the cut-off utterance, “can I ask for s-” (line 06) is completely
mistranscribed. Fourth, the words “sorry” and “he” in the key utterance (line 08) are also in
error. That being said, the poor auto-transcript of this excerpt, with a large WER of 56%, is
nonetheless adequate for enabling the discovery of another case of a change-of-state token used
as a turn preface (Heritage, 1984). With a quick listening of the audio and a little correcting of
the auto-transcript, this case was found in a haystack of 11 hours of talk-in-interaction with no
preexisting manual transcript.
Of course simple word searches have limitations as a method for identifying interactional
phenomena. This is true whether the analyst is using auto-transcripts or classic CA corpora.
Many phenomena do not have a single identifiable word or phrase associated with them. While
an easy case was chosen here, i.e., phenomena that contain distinctive tokens, well and oh, other
types of phenomena, such as sequence organization or repair, may not be so easily discoverable.
Similarly, instances in which a particular action is absent cannot of course be found by searching
for the action itself. In such cases, analysts must hunt for a phenomenon in more sophisticated
ways—for example, by searching for other words that tend to occur in the vicinity of the
phenomenon of interest. How widely applicable auto-transcription may be for locating known
interactional phenomena and discovering news ones is an open question. However, as ASR
researchers develop better techniques for generating “rich transcriptions” (NIST, 2004, p. 3),
including the automatic detection of “edit disfluencies”—revisions, repetitions, and restarts—as
well as “pause fillers,” or discourse markers, prosodic cues, and more, tools should become more
and more suitable over time for the analysis of interaction.

DISCUSSION

The comparisons detailed here attempt to demonstrate both the limitations and the potential
value of automated transcription for Conversation Analysis. It is clear that the auto-transcripts
used in this analysis are inadequate as final representations. This is not surprising, since they
were designed for subtitling videos and not for Jeffersonian transcription. However, auto-
transcripts may be adequate as rough, first-pass transcripts that can reduce the transcriptionist’s
work in terms of capturing words and timing silences. As we saw with Excerpts 2 and 5, the
word error rates can be as low as 2%. As a result, correcting word and timing errors and adding
AUTOMATED TRANSCRIPTION AND CONVERSATION ANALYSIS 269

prosody marks, overlaps, and other CA conventions involves much less work than starting
entirely from scratch.
But perhaps more importantly, automated transcription can provide analysts with a partial
transcript when there is nothing else. Using an auto-transcript to discover excerpts for further
scrutiny involves a tiny fraction of the work that would be required to transcribe an entire corpus
manually before exploring it. Consequently, auto-transcription could enable conversation ana-
lysts to increase the scalability of their method. In an era of “big data,” in which the volume of
digital data, including audiovisual recordings, is growing at a rapid rate, auto-transcription could
be one tool in helping conversation analysts to increase their reach across this growing sea of
data. Auto-transcription could greatly expand their universe of transcribed talk well beyond that
of Emma and Lottie or Ava and Bee.
One major limitation of automated transcription for CA involves average recording quality.
Although Congressional debates and hearings are instances of naturally occurring institutional
talk, they are somewhat unusual in that they are professionally instrumented for audiovisual
recording. As a result, the quality of the recordings is very high. The same cannot be said for
many recordings used in CA that are captured by researchers themselves. Such recordings are
often “messier” than C-SPAN recordings in that speakers may be far from a microphone, and
environmental noises, static, or even voices from other conversations may be audible in the
background. These and other features make it difficult for professional human transcriptionists to
sift out and capture the talk of interest and pose major challenges for the use of auto-transcription
(Cooke, 2006; Shriberg, 2005).
Another limitation, as mentioned, is that the quality of the auto-transcript is dependent on the
similarity of the labeled training data and the target recording. As a result, the acoustic and
language models trained on “Broadcast News” or “Telephone Conversations” do not work well
for other domains, such as doctor-patient interactions, in which both the vocabulary and the style
of talk are different. However, this limitation is also an opportunity. While toolmakers lack the
labeled data with which to train their systems for many domains, conversation analysts and other
speech researchers have accumulated massive sets of such recordings, along with detailed
manual transcripts. These materials could perhaps be used as the basis for training automated
transcription systems in new domains.

CONCLUSION

Like compact cassette recorders and sound-editing software, automatic speech recognition tools
have the potential to advance the field of Conversation Analysis by reducing the work of manual
transcription. The technology has made great strides and is useful today for naturally occurring
talk in some domains and under certain conditions. As ASR researchers continue to develop
techniques for more robust and richer automated transcription, the day may be nearing when
reliable speech-to-text will be a standard capability of most computing devices and available to
all. That day has not yet come, but the preceding examination has attempted to demonstrate that
the technology has come far enough that it can no longer be dismissed as purely science fiction.
How useful automated transcription may ultimately become for interaction research is still
unclear, yet the time is ripe for conversation analysts to begin experimenting with the technol-
ogy. Full automation of transcription will likely never be desirable for the field of CA; however,
270 MOORE

using partial automation in order to increase the scalability of conversation analytic methods
seems more promising. Automated transcription could thus help conversation analysts better
exploit the growing volume of big conversational data.

REFERENCES

Cooke, M. (2006). A glimpsing model of speech perception in noise. Acoustical Society of America, 11, 1562–1573.
Couper-Kuhlen, E., & Selting, M. (Eds.). (1996). Prosody and conversation. Cambridge, England: Cambridge University Press.
David, G., García, A. C., Rawls, A., & Chand, D. (2009). Listening to what is said—transcribing what is heard: The
impact of speech recognition technology (SRT) on the practice of medical transcription (MT). Sociology of Health &
Illness, 31, 924–938. doi:10.1111/shil.2009.31.issue-6
Drew, P. (1984). Speakers’ reportings in invitation sequences. In J. M. Atkinson & J. Heritage (Eds.), Structures of social
action (pp. 129–151). Cambridge, England: Cambridge University Press.
Hepburn, A., & Bolden, G. (2013). The conversation analytic approach to transcription. In J. Sidnell & T. Stivers (Eds.),
The handbook of conversation analysis (pp. 57–76). Oxford, England: Wiley-Blackwell.
Heritage, J. (1984). A change of state token and aspects of its sequential placement. In J. Maxwell Atkinson & J. Heritage
(Eds.), Structures of social action (pp. 299–345). Cambridge, England: Cambridge University Press.
Jefferson, G. (1983). Issues in the transcription of naturally-occurring talk: Caricature versus capturing pronunciational
particulars. Tilburg Papers in Language and Literature (No. 34). Tilburg University, Tilburg, The Netherlands.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H. Lerner (Ed.), Conversation analysis:
Studies from the first generation (pp. 13–31). Amsterdam, The Netherlands and Philadelphia, PA: John Benjamins.
Knox, M. T. (2013). Speaker diarization: Current limitations and new directions (Unpublished doctoral dissertation).
EECS Department, University of California, Berkeley, CA.
Ladefoged, P. (1993). A course in phonetics. Orlando, FL: Harcourt Brace.
Liu, Y., Shriberg, E., Stolcke, A., Member, S., Hillard, D., Ostendorf, M., & Harper, M. (2006). Enriching speech
recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech,
and Language Processing, 14(5), 1–15.
National Institute of Standards and Technology (NIST). (2004). Fall 2004 Rich Transcription (RT-04F) Evaluation Plan,
pp. 1–27. Retrieved from http://www.itl.nist.gov/iad/mig/tests/rt/2004-fall/docs/rt04f-eval-plan-v14.pdf
National Institute of Standards and Technology (NIST). (2009). The 2009 (RT-09) Rich Transcription Meeting
Recognition Evaluation Plan, pp. 1–18. Retrieved from http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meet
ing-eval-plan-v2.pdf
Picheny, M., Nahamoo, D., Goel, V., Kingsbury, B., Ramabhadran, B., Rennie, S. J., & Saon, G. (2011). Trends and
advances in speech recognition. IBM Journal of Research and Development, 55(5), 2:1–2:18. doi:10.1147/
JRD.2011.2163277
Pomerantz, A. (1984). Agreeing and disagreeing with assessments: Some features of preferred/dispreferred turn shapes. In J. M.
Atkinson & J. Heritage (Eds.), Structures of social action (pp. 57–101). Cambridge, England: Cambridge University Press.
Roberts, F., & Robinson, J. D. (2004). Inter-observer agreement on “first-stage” conversation analytic transcriptions.
Human Communication Research, 30, 376–410. doi:10.1111/j.1468-2958.2004.tb00737.x
Sacks, H. (1984). Notes on methodology. In J. M. Atkinson & J. Heritage (Eds.), Structures of social action (pp. 21–27).
Cambridge, England: Cambridge University Press.
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for
conversation. Language, 50, 696–735. doi:10.1353/lan.1974.0010
Schegloff, E. A. (1996). Confirming allusions: Toward an empirical account of action. American Journal of Sociology,
102, 161–216.
Shriberg, E. E. (2005, September). Spontaneous speech: How people really talk, and why engineers should care. Paper
presented at Interspeech’2005 – Eurospeech, 9th European conference on speech communication and technology,
Lisbon, Portugal. Abstract retrieved from http://www.isca-speech.org/archive/interspeech_2005/i05_1781.html
Soltau, H., Saon, G., & Kingsbury, B. (2010). The IBM Attila speech recognition toolkit. In Spoken Language
Technology Workshop (SLT), 2010 IEEE (pp. 97–102). doi: 10.1109/SLT.2010.5700829
Walker, G. (2013). Phonetics and prosody in conversation. In J. Sidnell & T. Stivers (Eds.), Handbook of conversation
analysis (pp. 455–474). Oxford, England: Wiley-Blackwell.

You might also like