0% found this document useful (0 votes)
86 views14 pages

Nidoking Annotation

The document outlines the workflow and guidelines for annotating audio segments in a conversation annotation tool, focusing on correcting transcriptions, timestamps, speaker identification, emotions, and locale/accent. It provides detailed instructions on managing audio regions, editing speaker turns, and tagging non-speech noises, emphasizing accuracy and adherence to specified rules. Additionally, it includes guidelines for emotional annotation and transcription formatting to ensure consistency and clarity in the final output.

Uploaded by

yellowstone07079
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views14 pages

Nidoking Annotation

The document outlines the workflow and guidelines for annotating audio segments in a conversation annotation tool, focusing on correcting transcriptions, timestamps, speaker identification, emotions, and locale/accent. It provides detailed instructions on managing audio regions, editing speaker turns, and tagging non-speech noises, emphasizing accuracy and adherence to specified rules. Additionally, it includes guidelines for emotional annotation and transcription formatting to ensure consistency and clarity in the final output.

Uploaded by

yellowstone07079
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Nidoking Segment-Level Annotation

Created Sept 18, 2025


Last Updated:
Oct 5, 2025 - all updates highlighted green
Sept 29, 2025
Sept 22, 2025

1. Overview
In this task, you will be working with a conversation annotation tool that shows audio segments
taken from a larger conversation. Your task will be to review and correct inaccuracies in pre-
existing transcribed segments, including segment timestamps, speaker, turn number, emotion,
and locale/accent.

Unit Workflow Overview


1. Listen to the entire audio segment and verify/correct number of regions, region
timestamp and turn type.
2. Review each region’s emotion and make corrections as needed.
3. Review the pre-filled transcription of the audio, making corrections where necessary.
4. Insert tags where needed.
5. Complete the Locale Identification for each Speaker in the Segment.

Key Terms
Unit: Corresponds to one page in the ADAP job.
Segment: A segment of audio with more than one speaker. This is what you’ll see in Waveform
at the top of each unit. Each unit contains one segment.
Region: A portion of the Segment that contains speech or non-speech noise. Generally
corresponds to a turn in the conversation.

2. Tool Overview
Audio Player & Controls
- Waveform: Visual representation of the audio with colored segments
- Zoom/Speed sliders: Adjust playback for better listening
- Play/Pause: Play and pause the audio
- + Region button: Add new segments for missed speech that needs its own region
- Region navigation: Click and drag red segments in the Waveform visual to alter region
coverage.

This content is for internal use only


Regions Table
List of regions with pre-populated content to be checked: Type, Emotion, Transcript
Use the Show only drop down to show only one Type in the Regions table.

Field Descriptions
- Time: The timestamp range for the region (modified by adjusting the regions in the
Waveform table at the top of the unit).
- Type: Speaker Identification (one speaker per turn)
a. You may also choose noise if it is a region for a non-speaker noise.
- Emotion: Overall emotional tone of the region’s audio content.
- Transcript: Pre-populated transcription of the audio in a region.
- Actions: Tools specific to the region.
a. Search: Auto zoom in on the region within the Waveform table.
b. Play: Play the audio for the region.
c. Hide: Hide the region’s highlight in the Waveform table.
This is helpful when creating a region that overlaps with another region.
d. AI Transcribe: Generate a fresh AI transcription in the Transcript text box
e. Delete: Remove a region from the Regions table

This content is for internal use only


3. Workflow Instructions
1. Managing Regions
Upon starting a new unit, the first thing to do is listen to the audio and verify that regions are
accurate. This Step should be completed simultaneously with Step 2. Type Field. If there
are any errors with the regions, they must be corrected in these first two steps.

Region Editing General Instructions

- Each speaker turn and each non-speaker noise should have its own region. Although
some short, overlapping speaker noises can be skipped. See below for more details on
non-Speaker noise regions.
- Ensure regions have accurate start and end times. If in doubt where to start or end a
region, opt to extend the start or end time as opposed to clip it shorter.
- Avoid having overly long segments. If a region goes beyond 25 seconds long, split
the region at a reasonable interval in the speaker turn (e.g. at a pause or change of
topic).
- Regions should never include pauses that are longer than 2 seconds. If a speaker turn
(region) has a pause that is longer than 2 seconds, even if the entire region does not
exceed 25 seconds, the region should be split and the pause excluded. Pauses that are
longer than 2 seconds should not be included in the region.

How to Edit Regions

Hover over the Waveform chart and drag highlighted regions in order to change timestamps in
the Regions table. When creating a new region, click on the Waveform chart where you’d like
the region to start and click +Region.

- Adjust boundaries: Make sure region timestamps accurately capture the speech for the
selected region.
- Expand regions: If there is unhighlighted speech that belongs to an already highlighted
region, expand the region's edge to encompass the correct audio for the selected region.
- Contract Regions: If there is highlighted speech that belongs to a different region or
needs its own region, contract the region’s edge to encompass the correct audio for the
selected region.
- Add regions: Add new regions when there is audio that does not belong to an already
created region.
- Remove regions: Delete regions that don't contain speech or noise.

Speaker Noises
Noises made by one or more of the Speakers (e.g. coughing, laughing, burping, etc.) should be
marked with a tag within the text of the transcription (See Section 5. Tags for more information

This content is for internal use only


on Tags). Speaker noises tagged within the transcription of speech should be noises made by
the speaker for that region, e.g. speaker noises must be tagged within the range of the speaker
who is making the noise.

Examples:
- If speaker 1 makes a brief cough while speaker 2 is speaking, do not annotate.
- If speaker 2 makes a brief cough in the middle of their own dialogue, tag it within the
dialogue.
- Speaker 1: Then you have to put the flour [cough] in the bowl.
o The [cough] should be from Speaker 1
o If the [cough] is brief and from Speaker 2, do not annotate.

Non-Speaker Noice Regions


A non-speaker noise is a noise that is made by something other than a speaker in the
conversation (e.g. ringing, paper crinkling, WhatsApp chime).

Non-speaker noises MUST always get their own region; you should select Region Type
Noise for non-speaker noise regions.
- Noise regions can, and often will, overlap with Speaker regions.
o For Example: A continuous knocking sound in the background while a speaker is
talking would get its own region, covering the entire length of the knocking (while
adhering to the max 25 secs/region rule), and the transcription would be only the
tag [knocking].
o For Example: There is music playing in the background of the entire
conversation. You must include a separate noise region that spans the
length of the non-speaker noise (in this example, the entire span where the
music is playing), even if this means there’s a separate noise region that spans
the entire conversation.

DO NOT LEAVE OUT NON-SPEAKER NOISE REGIONS. Non-speaker noise


regions MUST have their own regions in the Regions table. There should never
be sections of non-speaker noise that is not in a noise region.

- If a non-repetitive, non-speaker noise occurs that is very short and hard to capture in a
region, you can include the noise within the Speaker region as a tag. You do not need to
create separate regions for very short (~<1 sec) non-speaker noises.
o For Example: A speaker is talking, and one quick knock occurs in the middle of
the speaker turn. The knock is very short and hard to capture in the Waveform
table due to the small size of the region.
▪ This is what I heard yester- [knock] -day after I came into work.

2. Type Field

This content is for internal use only


The Type field can either be used for Speaker Identification or to mark a non-speaker noise.
Review and correct any issues in the Type Field for each region. This step should be
completed simultaneously with Step 1.

Field Type General Instructions

- Speaker identifiers should be used for the same speaker throughout a segment’s
regions.
- Each region should only have one Speaker Identifier, or it should be a Noise type.
- If in doubt who a speaker is, do your best to quickly guess which speaker it is based on
the voice and the flow of conversation. Sometimes you may have an ambiguous part of
the audio where it becomes impossible to choose the most likely speaker; the best route
for these challenging cases is to annotate the region using a new speaker identifier. Do
not spend too much time trying to decipher challenging region types. The most
important thing is indicating the speaker has changed.
- If there's more than one speaker, there must be more than one speaker type.
- When in doubt about how to deal with a Speaker Type, opt for more Speaker Types over
less (e.g. opt for a new Speaker Type over one already used).
a. It’s better to have one Speaker with more than one Speaker Type associated
with them than two different Speakers’ labeled with the same Speaker Type.
- Speaker Impersonating: When a speaker is impersonating different characters’
speech, each character should be marked with a different Speaker Type.
a. For Example:
The transcript has one or more Speakers telling a narrative. When the Speaker
performs dialogue in the narrative, they change which character they are
impersonating depending on whose dialog they are narrating (Speakers often
also change their voice). In cases like this, each character in the narrative
(including the narrator) should get a different Speaker Type & Region.

Overlapping Speaker Turns

- Regions can fully or partially overlap.


- When two speakers overlap, do your best to have regions that contain the entire
speech act of both speakers.
- Overlapping speech should not be ignored; overlapping speech must be
annotated within its own region.
- Equally with when a Speaker and a Noise overlap, create separate regions that contain
the entire speech in one and the entirety of the non-speaker noise in the other.
- When annotating overlapping turns, use the “Show only” drop down at the top of the
Regions table to show only the speaker you are working on.
- Segments from the same speaker should never overlap. You will receive an error for
regions that have same speaker overlap.

For Example:

This content is for internal use only


Consider the longer Speaker 1 Region here, starting with “Hi Travis…”

Speaker 2 interrupts Speaker 1 in the middle of this region, but Speaker 1 does not stop talking,
resulting in Speaker 2 overlapping with Speaker 1’s speech. After Annotating Speaker 1’s turn,
change to ‘Show only: Speaker 1’ and annotate a separate region containing Speaker 2’s
overlapping speech.

Changing to showing ‘All types’, the overlapping regions can be seen.

An overlapping noise with a speaker, or two noises that overlap with each other, would be
handled the same way as two overlapping speakers. Remember, ALL overlapping speech
and noises that are >=1 second long must be annotated.

This content is for internal use only


3. Emotions Field

Emotion annotation requires you to label speech regions with emotional categories.

Emotions Field General Instructions

- You should only label a region with an emotional category if the emotion is clear
and obvious. Otherwise, default to Neutral. Do not take more than a few seconds
to decide emotions. You should know rather quickly if an emotion is apparent in
the speech or not. Changing from Neutral should be obvious and not involve deep
analysis or back-and-forth. If you aren’t sure, then leave it Neutral.
- Emotions will appear in a dropdown menu in the target language for the job.
- Determine the emotional category of a region based on vocal features in the audio (i.e.
how the speaker is saying it).
- Vocal features may include tone, pitch, intensity, emphasis of certain words, etc.
- Don’t make assumptions based on word choice or the topic of the conversation; only
analyze Emotions based on vocal features.
- Most regions will naturally not have any emotions; these should be annotated with
Neutral.

Remember, emotions analysis should focus on the delivery of the speech, not the
content.

List of Emotions

- Neutral (default)
- Sad
- Happy
- Angry
- Fearful
- Disgusted
- Surprised
- Confident
- Nervous
- Loving
- Bored
- Other

Choosing ‘Other’ Emotion

If the segment expresses an emotion that is not on the dropdown list, choose “other” and type in
the emotion that best describes the segment speech.

This content is for internal use only


- Emotions that are typed into the text box must be in the target language of the job.
- If the target language has gendered adjectives, use the default version (usually
masculine or neuter).
- Emotions typed into the text box must be concise and no more than one to three words
long.
- Don’t be repetitive, e.g. don’t put something like ‘furiously angry’, as that’s repetitive;
instead use ‘furious’ or ‘very angry’.
- Be sure you use correct spelling when typing in text boxes.
- Don’t use slang that isn’t common or that most native speakers of the language wouldn’t
understand/use.

4. Transcription Field

Transcription Field General Instructions

This content is for internal use only


- Words should be annotated exactly as spoken (e.g. d- d- did you see the um game las
night?)
a. Annotate all studders, partial and repeated words, disfluencies, etc.
• For Example:
And wh- what did they tell tell you?

- For ambiguous sounding words where you need to guess due to uncertainty, surround
the word with hashtags.
a. For Example:
I like your #blue#
b. If a word is completely unintelligible, use the [unintelligible] tag (see below in
Section 5 Tags for more information).

- Transcriptions should use standard capitalization and punctuation for the given
language, e.g. commas, question marks, period, exclamation marks, quotes, etc.
a. Exclamation points supersede question marks when both would be appropriate.

- Emphasis and prolonged vowels should only be annotated when they are clearly and
obviously present in the speech. When in doubt, do not include.
a. When a speaker emphasizes a word while speaking, capitalize all the letters in
the word and surround it with three asterisks on each side.
• For Example:
I cannot ***BELIEVE*** she said that!
b. Prolonged vowels should be marked on either side with an asterisk.
• For Example: She said th*a*t?

- Numbers should be written out as spoken.


For Example:
a. CORRECT: The time is seven thirty.
NOT CORRECT: The time is 7:30.
b. CORRECT: It happened in twenty twenty five.
NOT CORRECT: It happened in 2025.
c. CORRECT: The total was twelve ninety-nine.
NOT CORRECT: The total was $12:99.

- Special Characters should be written out as they are pronounced. DO NOT use a
symbol to represent a word.
For Example:
a. CORRECT: She gave me seven percent.
NOT CORRECT: She gave me 7%.
b. CORRECT: The party was fun slash weird.
NOT CORRECT: The party was fun/weird.

This content is for internal use only


- Abbreviations should be written out the way they are pronounced.
For Example:
a. CORRECT: Mister Johnson
NOT CORRECT: Mr. Johnson
b. CORRECT: Doctor Smith
NOT CORRECT: Dr. Smith
c. CORRECT: Elisabeth Street
NOT CORRECT: Elisabeth St.

- Acronyms are words made up of the first letters of other words and are pronounced as
a word. Acronyms should be written with all capital letters, no spaces.
a. CORRECT: NASA
b. CORRECT: NATO

- Initialisms and words that are verbally spelled out should be spelled using capital
letters joined by underscores.
a. CORRECT: I_B_M
b. CORRECT: U_S
c. CORRECT: J_A_M_I_E

5. Tags

Tags General Instructions

- All non-speech sounds should be transcribed with a Tag in the target language of the
job.
a. FOR EXAMPLE:
[cough] --> French [toux] / Spanish [tos] / German [husten]
- Non-speech sounds that are from the speaker (e.g. the speaker coughs, burps, sniffs,
etc.) should be labeled with a tag within the speech transcription of the speaker who
makes the sound.
a. FOR EXAMPLE:
Speaker 1: They’re not [sniff] not sure what to do [sniff].
- Non-speaker sounds should get their own region and be labeled with Type Noise. The
transcription for Noise regions should be a tag that represents the non-speaker noise
(e.g. [water running], [dog barding], etc.)
- Tags must always be put between brackets, e.g. [sneeze]

List of Tags

All tags should be in the language you are working, e.g. if you are working in Spanish, all
tags should be localized into Spanish. The below lists of tags are not exhaustive. If there are
sounds, coming from the speaker or from a non-speaker object, that are not covered by the

This content is for internal use only


below list, you should create a tag that better fits the sound. Always put tags between
brackets and localize them into the language of the transcription.

While tags should be short and concise, they should be as descriptive as possible (e.g.
What’s App chime, nails on chalkboard, etc.)

Before submitting, confirm that ALL tags are consistently translated into the target
language of the transcript. This includes the [unintelligible] and foreign language tags.

List 1: Unclear Language


Use the following tags when words or phrases are hard to hear or understand.
Tag Description Example
Replace words or phrases that
I really felt very
cannot be guessed due to
[unintelligible] [unintelligible] came by
unintelligibility with the
the house.
[unintelligible] tag.

List 2: Speaker Sounds


Annotate within region of speaker who makes the sound.
Tag Description
• If a word or phrase is in a language other than the
target language of the job, and you cannot identify the
language spoken, replace the word/phrase with the
[foreign] tag.
[foreign] • If you can identify the language spoken, replace the
[Spanish] [foreign] tag with a tag indicating the language being
[Hindi] spoken, e.g. [French]
etc... • Language tags should be written in the language of the
transcript, e.g. in a German job, Arabic speech should
get the tag [Arabisch] (“Arabic” in German)
• If the foreign language spoken is English, do not use a
tag, instead English speech should be transcribed.
[laugh] a speaker laughs
[clears throat] a speaker clears their throat.
[sigh] a speaker sigh
[snort] a speaker snort
[cry] a speaker cry
[lip smack] the sound of a lip smack
[breath] a noticeable speaker breath
[cough] a speaker cough

List 3: Non-Speaker Sounds


Use Noise Region
Tag Description

This content is for internal use only


[click] a click noise
[ring] a ring noise
[knock] a knock sound
[bang] a bang sound
[dial tone] a dial tone
[distant background speech] distant background speech
[distant background noise] an unidentified distant background noise
an unidentified noise captured in close proximity to the
[unidentified noise]
speaker (not ‘distant’)
[echo] an echo
[car engine] the sound of a car engine
[static] the sound of static

6. Locale Identification

The Locale Identification section is required for all Speakers. Locales pre-filled should be
checked for accuracy. Do your best to correctly identify each speaker’s locale dialect. Accent
Descriptions are optional if you are able to confidently give additional, more specific information
about the dialect.

List of Locales
Only use the locale codes listed below in the “Locale” text box. Use only Locale codes from
your locale (e.g. if you are working in an English job, you should use English locale codes).
Locale codes should be written as four letter codes with an underscore; codes must be written
exactly as they appear in this list under “Locale Code”.
Locale Locale Code
Czech – Czech Republic cs_CZ
English – United States en_US
English – Great Britan en_GB
English – Canada en_CA
English – Australia en_AU
English – New Zealand en_NZ
English – Ireland en_IE
English – South Africa en_ZA
French – France fr_FR
French – Canada fr_CA
French – Belgium fr_BE
French – Switzerland fr_CH
German – Germany de_DE
German – Austria de_AT
German – Switzerland de-CH
Hungarian – Hungary hu_HU

This content is for internal use only


Italian – Italy it_IT
Italian – Switzerland it_CH
Portuguese – Portugal pt-PT
Portuguese – Brazil pt_BR
Romanian – Romania ro_RO
Romanian – Moldova ro_MD
Spanish – Spain es_ES
Spanish – Mexico es_MX
Spanish – Argentina es_AR
Spanish – Colombia es_CO
Spanish – Chile es_CL
Spanish – Peru es_PE

Examples:

All the below examples are correct based on the knowledge of the example annotator.

In this first example, the annotato identified both speakers as having a US English Accent. They
were also able to identify the region of the country the accent was from. This level of detail is
desired but not required. While Locales are required, please only give details in the Accent
Identification box if you feel fairly certain of your analysis.

In the second example, the annotator identified both speakers as speaking Spanish from Spain,
but they did not feel confident identifying the specific regional accent, so they left the accent
description boxes empty.

This content is for internal use only


4. Quality Tips
- Listen to entire audio initially to understand the full conversation flow.
- Region management first! Always verify and correct regions before annotating
anything else.
- Use playback controls (zoom/speed) if audio is unclear
- Be consistent with your classification choices across similar segments
- Default to neutral when unsure if emotion is expressed in the region.
- Focus on primary characteristics - choose the most prominent emotion and only if the
emotion is clearly demonstrated in the region’s speech (tone, emphasis, etc.).
- Work systematically from top to bottom of the regions table in each step.
- Only input dialect/accent information if you feel fairly certain of your conclusion. It’s
okay to leave the dialect analysis box fully or partially empty. Fill it in as much as you are
able with ~80% confidence.

This content is for internal use only

You might also like