You are on page 1of 6

· Health and Wellbeing

· History
· Home and Gardening
· Legal and Courtroom
· Money and Finance
· Pets and Animals
· Politics and Current Affairs
· Religion and Spirituality
· Science and Nature
· Sports
· Technology
· Travel and Hospitality
· Trivia
· Weather

Transcribe Long-Form Transcription Guidelines


Version: 3.0
Release Date: 20191204

· 1. Introduction
· 2. Segmentation
· 2.1. Creating Segments
· 2.1.1. General Segmentation Requirements
· 2.1.2. Specific Requirements for Each Segment Type
· 2.1.2.1. Speech
· 2.1.2.2. Babble
· 2.1.2.3. Overlap
· 2.1.2.4. Music
· 2.1.2.5. Noise
· 2.2. Segmentation Examples
· 2.2.1. Example 1 - Segmenting an Audio File with Split-Channel Conversation Telephony
· 2.2.2. Example 2 - Segmenting a Co-Channel Media File
· 2.3. Labelling Segments
· 2.3.1. All Segments
· 2.3.2. Speech Segments Only
· 3. Transcription Conventions
· 3.1. Characters and Special Symbols
· 3.2. Spelling and Grammar
· 3.2.1. Dialectal Pronunciations
· 3.2.2. Mispronounced Words
· 3.2.3. Non-Standard Usage
· 3.3. Capitalization
· 3.4. Abbreviations
· 3.5. Contractions
· 3.6. Interjections
· 3.7. Individual Spoken Letters
· 3.8. Numbers
· 3.9. Punctuation
· 3.10. Acronyms and Initialisms
· 3.11. Disfluent Speech
· 3.11.1. Stumbled Speech, Repetitions, and Truncated Words
· 3.11.2. Filler Words
· 3.12. Overlapping Speech
· 3.12.1. Conversational Telephony
· 3.12.2. Media
· 3.13. Unintelligible Speech
· 3.14. Non-Target Languages
· 3.15. Non-Speech
· 3.15.1. Non-Speech Noises
· 3.15.2. Silence/Pauses
· 4. Metadata Labelling
· 4.1. Labelling the Transcribed File
· 4.1.1. File-level Values
· 4.1.2. Annotator Information
· 4.2. Labelling Speakers in the Transcribed File
· 5. Appendix A: The Complete Set of Non-Speech Tags and Other Markup Tags

1. Introduction
Transcription is the commitment of an audio signal to textual representation. This can include
representing speech data as well as other sound types such as phones ringing or music.

2. Segmentation
Segmentation is the process of "timestamping" the audio file for each given speaker. It involves
indicating structural boundaries within an audio file, such as sound types, conversational turns,
utterances, and phrases within an audio file. Segment boundaries also facilitate the transcription
process by allowing the transcriptionist to listen to manageable chunks of segmented speech at a time.
2.1. Creating Segments
2.1.1. General Segmentation Requirements

· Create segments (i.e. timestamping an audio file) according to the five segment primary types listed
in Section 2.1.2. The five primary types are:
· Speech
· Babble
· Overlap
· Music
· Noise
· Each segment will be timestamped to the milliseconds. Timestamps must be positive floating numbers,
in the format of seconds.milliseconds (e.g., 12.345 for 12 seconds and 345 milliseconds).
· Each segment should have only one primary sound type, which will be listed as the primaryType — one
of the segment objects — in the transcription JSON. See Section 2.1.2 for the required sound types and
their requirements.
· Create each segment tight around its targeted sound type. Leave out continuous stretches of
silence/white noise that last two or more seconds at the beginning, in the middle, or at the end of the
segment.
· Transcription is needed only for Speech segments.

2.1.2. Specific Requirements for Each Segment Type


2.1.2.1. Speech
· Create Speech segments for audio signals that consist of speech from one to two intelligible foreground
speakers (i.e., speakers of interest). The speech in a Speech segment needs to be transcribed.
· For conversational telephony containing split-channel speech (i.e., one channel, one foreground
speaker), create segments only for the speech from the foreground speaker on that given channel.
· Don't create Speech segments for overlapping speech that takes place in the background (e.g.
people standing nearby or in the same room talking). See Section 3 Transcription Conventions on how to
transcribe foreground speech that overlaps with background speech.
· For media data containing co-channel speech (i.e., one channel, multiple foreground speakers), create
separate segments for the speech from each foreground speaker.
· If there is intelligible overlapping speech from two foreground speakers (e.g., when two interviewees
are speaking at the same time), create an individual speech segment for each of the two foreground
speakers (even if one of the foreground speakers might be unintelligible). Each segment must has its
own unique segment ID. See Section 3 Transcription Conventions on how to transcribe segments
involving overlapping foreground speech.
· For the ease of segmentation, it is OK for the two individual segments to have the same start time and
end time.
· Don't create Speech segments for overlapping speech (a) between two unintelligible foreground
speakers or (b) between three or more foreground speakers regardless of intelligibility. Create Overlap
segments for these sound types instead.
· Don't create Speech segments for overlapping speech that takes place in the background (e.g. people
talking behind a field reporter reporting in a scene). See Section 3 Transcription Conventions on how to
transcribe foreground speech that overlaps with background speech.
· Segment boundaries should be as natural as possible (e.g., end of a turn, end of a complete sentence,
between phrases, before and after a filled pause). Segment boundaries should never be in the middle of
a word.
· Each segment should consist of speech that forms a natural conversational unit or a linguistic unit (e.g.,
speech belonging to the same conversational turn, speech belonging to the same sentence or phrase).
One exception to this is when two individual speech segments are created for two overlapping
foreground speakers, and when they share the same start and end time, it is OK if one of these
segments consists of speech that doesn't form a natural conversational or linguistic unit.
· Don’t break up a turn or a sentence into different segments unless it exceeds 15 seconds.
· Due to the preference to have segment that is conversationally or linguistically related, speech segment
can include occasional silence/white noise or other sound types (e.g., music, noise) as long as they
are two seconds or less each. See Section 3 Transcription Conventions on how to transcribe segments
involving non-speech noises.
· Each segment should not exceed 15 seconds. Whenever possible, create segments closer to 15 seconds.
2.1.2.2. Babble

· Create Babble segments for audio signals that consist of speech or isolated vocal noise (e.g. coughing,
laughing) from one or more background speakers (e.g., people standing nearby or in the same room),
even if the speech is partially intelligible.

2.1.2.3. Overlap

· Create Overlap segments for audio signals that consist of overlapping speech between two or more
unintelligible foreground speakers or between three or more foreground speakers, regardless of
intelligibility. Use this also when there is overlapping speech between two or more speakers but it is
difficult to differentiate between foreground and background speakers.

2.1.2.4. Music

· Create Music segments for audio signals that consist of music, songs, singing, or sounds from musical
instruments. This includes theme songs or characters singing songs.

2.1.2.5. Noise

· Create Noise segments for audio signals that consist of any isolated non-speech noise (e.g., applause,
phone ring).

Notes: The term "foreground speaker(s)", or "speaker(s) of interests", refers to the speaker(s) that a
particular recording is intended to capture. For split-channel conversation telephony (i.e. one speaker,
one channel), the foreground speaker is either the caller/agent or the call-receiver/customer. For co-
channel media data (i.e., one channel, multiple foreground speakers), the foreground speakers will vary
depending on the domains. In a political debate, for example, the range of foreground speaker(s) could
include the host, the debaters, and potentially members in the audience with questions; in a reality
television show, the foreground speaker(s) would include all of the protagonists featured.
See Section 2.2 below for some segmentation examples.
2.2. Segmentation Examples
The following examples visualize the desired segmentation based on the segmentation requirements
outlined above. Each visualization has six rows:
Row Description

0 Audio signals

1 Start time - End time

3 Segment ID

3 Segment Primary Type


4 Speaker ID

5 Transcription

Segment boundaries are the blue vertical lines.


2,2.1. Example 1 - Segmenting an Audio File with Split-Channel Conversation Telephony

1. Segmentation is tight around each targeted primary type (i.e., Speech in this example).
2. Long stretches of silence/white noise are left out (e.g., between 3.638 and 8.910 seconds).
3. Each segment is less than 15 seconds.
4. Segment 001 consists solely of unintelligible speech from the foreground speak. It is still classified as
Speech and the speech is transcribed as best guesses.
5. Each Speech segment consists of speech that is conversationally or linguistically related.
a. Segment 001 and Segment 002 each consists of a single speaker turn, followed by a pause.
b. Segment 003 consists of a complete sentence. The end of the segment constitutes a sentence
break.
c. Segment 004 consists of another complete sentence, with a 1.5 second pause transcribed as
[no-speech]. The sentence is not broken up into two segments at the pause because that would
have resulted in a segment with speech that is not linguistically or conversationally related (i.e., "#ah,
we're going to talk about #um").

2.2.2. Example 2 - Segmenting a Co-Channel Media File


1. Segmentation is tight around each targeted primary type (e.g. Speech, Music).
2. The media file consists of multiple speakers. Each segment consists of transcribed speech from a single
speaker. Segment 00001 consists of speech from "m_0001", Segment 00002 consists of speech from
"f_0001", Segments 00004-00006 consists of speech from "Vinny".
3. Segment 00003 consists solely of music and is therefore classified as Music as its primaryType. No
speaker ID, language, and transcription is needed.
4. Segment 00005 consists of speech with music playing in the background. When the speech stops, the
background music continues for more than 1 second which is transcribed with the [music] tag.
5. Some other Speech segments (e.g.,00004) consist of speech with music playing in the background. The
speech is transcribed, without the use of the [music] tag.
6. The continuous stretch of speech from 14.054-33.563 is divided into two segments, Segments 00004
and 00005, because otherwise, the segment will be over 15 seconds long. The division takes place at the
end of a sentence break (i.e., at 22.239).

2.3. Labelling Segments


Each segment must contain the list of segment objects in the tables below. Some objects must be
present and filled regardless of the primary type of a segment. Other objects must be present and
filled for Speech segments only and excluded from other segment types.

2.3.1. All Segments


For all segment types, the following objects must be present and filled:
Segment Object Description

Start time Start timestamp of the segment in the format of seconds.milliseconds.

End time End timestamp of the segment in the format of seconds.milliseconds.

Segment ID A string that uniquely identifies the segment.

You might also like