Ontology 2023

Audio Annotation
Ontology
The purpose of the task is to annotate audio data based on labels and different
aspects. We will utilize this data to enhance speech-processing machine-learning
models. We will improve our software to remove all unwanted noises during audio calls
and video conferences. More specifically, our software will help to enhance speech
quality and improve digital communication.
To complete this activity, you will use our ontology to understand how to classify
and label audio files. This ontology will be your principal source of information.
Our ontology is organized as follows:
1. A main type (MT): speech, noise, or mixed.

2. A subtype (ST): depending on the MT.
3. Tags: one, more, or optional depending on the ST.
4. Reverberation: low, med, or high.
Please follow the instructions and study each case carefully.
Speech
______________________________________________________________________
Speech: audio files containing clean speech in any language and from one person
speaking at a time. Clean speech refers to audio files without any sounds, regardless of
their duration and level.
This main type does not include audio files with overlapping speech/voices,
which means a clean speech file containing more than one speaker speaking at a time.
The subtypes included in this MT are:
1. Speech-clean high pitched: someone speaking with a high-pitched voice. We

include reading, people having conversations, speeches, etc. Keep in mind that it
must be one speaker at a time.
Tags: male, female + language + emotions

2. Speech-clean read: you can identify that the speaker is reading. We include
scripted speeches, storytellers, news reporters, etc. One reader at a time.
3. Speech-conversational/animated: different from reading, this subtype refers to a

fluent and spontaneous conversation or monologue. It expresses natural emotions.
One speaker at a time.
4. Speech-children: it only includes children speaking spontaneously or reading. Use

the corresponding subtype for the other cases (singing, whispering, laughing, etc.).
One speaker at a time.
Tags: language + emotions
5. Speech-acapella singing: someone singing without music or sounds. One singer

at a time.
In the case of non-lexical singing or vocalization, which refers to singing

nonsense syllables such as "la la la," "na na na," or "da da da" (scat singing in vocal
jazz). These nonsense syllables represent singing. Annotate them as
speech-acapella singing only if the audio file is clean and does not include music or
any other sound.
To learn more about this topic, please watch the following video:
Exploring Jazz Vocals and Scat Singing
Tags: male, female, children + language + emotions
6. Speech-strong reverb: a clean speech audio file (a conversation, reading, or

singing without sounds) with a very strong reverberation. The reverb is real and
natural (not fake or the echo where you can hear a distinct repetition of words or
sounds one or more times). It includes only one speaker at a time. Mark these files
as high.

7. Speech-whispering: someone whispering (a reading, conversation, or
monologue). One speaker at a time.
8. Speech-laughing: a completely clean file (without any kind of sounds). The file
contains natural or fake laughter, exaggerated or not. One person laughing at a
time. Consider that if the file includes laughter and other sounds, it must be
annotated as noise (you will find more information in the noise section).
Tags: male, female, children + emotions
9. Speech-unintelligible vocal sounds: a clean file that contains vocal sounds

produced by humans that are not considered speech, such as the imitation of
animals, ghosts, creatures, and cartoons. Only one person at a time.
Tags: male, female, or children
10. Speech-other: a completely clean file that contains any of these two cases:
● Any other clean speech example you cannot include in the previous options.
● The mixture of two different speech subtypes but none of them predominates.
One example would be conversational/animated and whispering. Half of the
audio file contains a person speaking spontaneously (conversational/animated),
and in the other half, the person whispers (whispering).
Tags: explain
For example, speech-other-conversational, whispering
As we mentioned above, some subtypes require you to add tags to identify

the language and emotions you find in the audio file. For these cases, use the
following information as a reference:
● Language: you will find audio files in any language. If you identify it, please add
the tag.
Tags: english, spanish, french, italian, etc.
● Emotions: add the tag to identify the emotion expressed in the audio file.
Tags: angry, fearful, sad, excited, happy, disgusted, bored, mad, spontaneous,
disappointed, enthusiastic, neutral
Noise
______________________________________________________________________
Noise: an audio file containing sounds of any kind, duration, or level. It does not include
intelligible speech (words or part of them) regardless of the language, duration, or level.
1. Noise-human (non-speech): this subtype includes the natural sounds humans

produce, such as coughing, breathing, whistling, burping, baby babbling, gargling,
footsteps, humming, etc.
Do not consider this subtype in cases where other characters (animals,

ghosts, creatures, cartoons, etc.) produce these sounds. Use the subtype öther¨ or
the ones related to animals as corresponds.
Tags: breathing, snoring, coughing, yawning, burping, farting, whistling, sneezing,

humming, footsteps, barfing, baby babble, throat clear, gargle
2. Noise-clapping/hands: it contains clapping/applause (no cheering or whistling) or

finger-snapping sounds. One or more people are involved.
Tags: —
3. Noise-crying: this subtype represents humans crying for any reason (natural or
fake). Do not consider this ST in cases where other characters (animals, ghosts,
creatures, cartoons, etc.) produce this sound. Use the ST öther¨ if you hear a
ghost, creature, or cartoon crying.
Tags: baby, male, female, children, child, group
4. Noise-laughing: this subtype represents humans laughing. Small, big groups or

someone laughing but with other sounds (natural or fake laughter). Do not consider
this ST in cases where other characters (animals, ghosts, creatures, cartoons, etc.)
produce this sound. Use the ST öther¨ if you hear a ghost, creature, or cartoon
laughing.
Remember the case of speech-laughing. We mark laughter as speech only
when the file is clean (it does not include sounds).
Tags: group, crowd, female, male, child, children.
5. Noise-screaming: it represents humans screaming for any reason. Do not

consider this ST in cases where other characters (animals, ghosts, creatures,
cartoons, etc.) produce this sound. Use the ST öther¨ if you hear a ghost, creature,
or cartoon screaming.
It is essential to highlight that the audio file must not contain speech. The
screams must not include words. If that is the case, you must annotate the file as
mixed-other.
Tags: baby, kids, adults, horror
6. Noise-eating/digestive: it contains the sounds people make when eating or

drinking.
Tags: crunchy, drinking, chewing, swallowing, slurping
7. Noise-food packaging: this subtype represents the sounds a person produces

when manipulating food packaging (bags, cans, or bottles).
Tags: opening bags, opening bottles, squashing bottles, crushing cans, crushing
food bags
8. Noise-alarm/siren/horn: this subtype represents all kinds of alarm, siren, and horn
sounds. The alarm, siren, or horn sound must predominate in the audio file. If not,
determine if you need to use another subtype such as ïndoor non-intelligible
noise¨, öutdoor non-intelligible noise¨, ¨vehicle¨, etc.
Tags: alarm, siren, horn.
9. Noise-bell: this subtype represents all kinds of bells, ringtones, polyphonic sounds
(cell phones), or chimes. The bell sound must predominate in the audio file. If not,
determine if you need to use another subtype such as indoor or outdoor
non-intelligible noise, ¨domestic sounds/home sounds,¨ etc.
Tags: —
10. Noise-domestic animals/pets: sounds produced by domestic animals such as
dogs and cats. This subtype includes sounds produced with their mouth and limbs.
Do not specify the type of sound the animal produces. The only exception is
barking; please add the tag ¨dog barking¨ to label this specific sound.
This ST does not include birds (rooster, duck, hen, chicken, owl, eagle, crow,
or any other).
Tags: dog, dog barking, cat
11. Noise-livestock/farm animals: sounds produced by farm animals such as pigs,

horses, goats, cows, bulls, sheep, and donkeys. This subtype includes sounds
produced with their mouth and limbs.
This ST does not include birds (rooster, duck, hen, chicken, owl, eagle, crow,
or any other).
Tags: pig, horse, goats, cow, bull, sheep, donkey
12. Noise-wild animals: sounds produced by wild animals such as birds, rats, snakes,
insects, flies, bees, crickets, wasps, toads, monkeys, elephants, lions, tigers, bears,
dolphins, and whales, among others. This subtype includes sounds produced with
their mouth and limbs.
● Type of birds: rooster, duck, hen, chicken, owl, eagle, crow, etc. You do not
need to specify the type of bird in the tag. Label all the types of birds as
¨birds¨ and continue with the following file.
Tags: birds, rat, snake, insects, bee, fly, crickets, elephant, wasp, frog, monkey,
dolphin, lion, tiger, bear, wolf, hyena, elk, sea lion, bat, whale
13. Noise-domestic sounds/home sounds: this subtype represents the sounds

produced in a house and all its environments (kitchen, bedroom, living room, dining
room, bathroom, basement, doors, etc.).
To annotate these audio files correctly, please add the tag ¨kitchen¨ and
identify the object you hear, e.g., oven, dishes, knives, frying, chopping, mixer, etc.
This ST includes all kinds of door sounds (including car, metal, sliding doors,
etc.).
In the case of a kitchen or bathroom, these spaces represent a mixture of

sounds like water or liquid, a hair dryer, a washing machine, doors, frying, plates,
etc. Use this ST to represent that ambiance. If the water or liquid sound
predominates, use the ST ¨liquid/water¨ and add the corresponding ¨some.¨
Exception: this subtype does not include a slamming door sound.

Considering it as an impact, we represent it in the ¨generic impact sound¨ subtype.
Tags: vacuum, bathroom, doors, hinge, washing machine, keys, kitchen +

microwave, plates, cutlery, chopping, etc.
14. Noise-vehicle: this subtype represents vehicle sounds, their cabins, or their parts
except their engines, motors, or doors. The sounds of these specific parts
correspond to the subtypes ëngine¨ (engines and motors) and the ¨domestic
sounds/home sounds¨ (all kinds of doors).
In addition to automotive vehicles, non-automotive can be included (used for

transporting someone or something), such as skateboards, stretchers, bicycles, etc.
When annotating a file containing cabin sounds such as train, airplane, or

car cabins, please add the corresponding and complete tag to identify these
specific sounds.
Tags: train cabin, airplane cabin, car cabin, car, airplane, motorcycle, bike, horse
carriage, train, cart, skateboard, stretcher, scooter, tractor, truck, bus, helicopter,
carriage, boat, ambulance
15. Noise-engine: this subtype represents the sound of engines and motors (of any
kind) without other vehicle sounds.
This subtype includes the following sounds:
● Roaring: accelerating a vehicle and making a high-rev engine noise.

● Low rev: the sound of a low-rev vehicle engine when it is not moving (idling
engine). Please watch these videos as a reference:
○ Lowest Revving Gasoline Engines
○ Lowest Revving Gasoline Engines
● Motors: the regular motor sounds. The motor roars, but not too loudly.
Tags: roaring, low rev, motors
16. Noise-group babble: the sounds a group (two or more people) or crowd produces
when speaking simultaneously, e.g., at a theater, airport, food court, etc. Nothing of
what the speakers say can be understood (no words or parts of them regardless of
the language, duration, or level), only the babble sounds. Therefore, if an audio file
contains babble sounds, but you also hear a word or part of it, mark the file as
mixed-other. Remember that the mixture of noise and speech results in a mixed file.
Tags: —
17. Noise-group cheering: the sounds a group (two or more people) or crowd
produces when cheering (expressing joy or support) or booing (showing
displeasure).
This subtype represents a mixture of sounds, e.g., people shouting/yelling,

applause, whistles, music, laughter, etc. Nothing they express can be understood
(no words or parts of them like woohoo, wow, oh, goal, etc.), only the sounds.
Therefore, if people are cheering or booing, but you also hear a word or part of it,
mark the file as mixed-other. Remember that the mixture of noise and speech
results in a mixed file.
When adding the tags, please label the sounds you identify in the audio file,
e.g., in the case of group cheering: some clapping, some screaming, some music,
some laughing, etc. In the case of group booing: booing and some screaming,
some music, some laughing, etc.
Tags: children, small group, crowd, adults, booing, some screaming, some music,
some laughing, etc.
18. Noise-indoor non-intelligible noise: This subtype represents a mixture of sounds

you can hear in indoor areas.
It includes sounds from a not-so-busy location like a quiet office, where no

other relevant sounds are involved, or the sounds of a loud place with many or
several sounds, like a basketball game.
The list of spaces is diverse and broad. Verifying that the sound corresponds
to one of the following tags is recommended. If the location you identify is not on
the list, please add a new tag if necessary.
When adding the tag, please label the indoor space and the sounds you
identify in the audio file, e.g., in the case of an office: some keyboard, some printer,
some phone, some babble, etc.
Tags: sports events, cafe, restaurant, pool, school, hospital, office, room, house,
laundry room, living room, garage, church, subway, museum, library, building,
bank, hall, casino, etc.
19. Noise-outdoor non-intelligible noise: This subtype represents a mixture of

sounds you can hear in outdoor spaces.
It includes sounds from a not-so-busy location like a quiet park, where no

other relevant sounds are involved, or the sounds of a loud place with many or
several sounds, like a noisy playground.
The list of spaces is diverse and broad. Verifying that the sound corresponds
to one of the following tags is recommended. If the location you identify is not on
the list, please add a new tag if necessary.
When adding the tags, please label the outdoor space and the sounds you
identify in the audio file, e.g., in the case of a forest: some water, some birds, some
insects, some wind, etc.
Tags: street noise, park, stadium, theme park, playground, parking lot, square,
train station, fair, mountain, desert, beach, church, construction, forest, jungle,
harbor, etc.
20. Noise-mechanisms/tools: this subtype represents all kinds of tools or

mechanisms sounds, e.g., drill, hammer, saw, compressor, lawn mower, chains,
ropes, welding machine, or excavator. Add others after verifying that list does not
include them already.
Tags: dremel, servo motor, hand-drill, hammer, jackhammer, saw, hand saw,
electric saw, electric razor, hair trimmer, shaver, sander, ratchet, electric
screwdriver, compressor, wrenches, screws, shovel, lawn mower, spatula, chain,
rope, welding machine, hydraulic pump, digger, etc.
21. Noise-devices: this subtype includes any sound of medical, digital, or office
devices, e.g., heartbeat or blood pressure monitors, clocks, cameras, printers,
faxes, typewriters, etc.
Tags: defibrillator, blood pressure monitor, nebulizer, heart rate monitor, chemical
analyzer, clock, camera, printer, scanner, fax, photocopier, cash register,
typewriter, amplifier, adding machine.
22. Noise-keyboard/mouse: this subtype includes all computer keyboards and mouse
click sounds.
Tags: keyboard, mouse
23. Noise-fan/air conditioning: this subtype includes all the sounds of a fan or an air
conditioner. Different from the wind sounds. The device is the primary source of the
sound.
Tags: fan, air conditioning
24. Noise-instrumental music: this subtype includes any kind of music (with or without
sound effects), musical notes, and sounds produced by any musical instrument but
without speech or singing. If the audio file contains singing and music, you must
annotate it as mixed-vocal music.
It is not necessary to add a tag. However, if you identify the musical

instrument in the audio, please label it.
Tags: —
25. Noise-generic impact sounds: this subtype represents soft, moderate, or loud
impact sounds produced by objects or actions (not an explosion), e.g., the sound of
a slamming door, one thing hitting another, an object falling to the ground, etc.
Exception: considering we have different subtypes to represent objects and

actions, if the sound symbolizes an impact, please use this subtype, e.g., a
slamming door (do not use ¨domestic sounds/home sounds¨), glass breaking (do
not use ¨glass¨), loud footsteps (do not use ¨human (non-speech)¨), wood or metal
impacts (do not use ¨wood¨ or ¨metal¨).
Tags: slamming doors, hits, slap, kick, slam, knock
26. Noise-surface contact: this subtype represents soft, moderate, or loud sounds
someone produces when rubbing, touching, or manipulating an object, fabric, or
surface, e.g., sliding, rolling, or moving an object on a table or other surfaces.
It is essential to highlight that hitting is not considered ¨surface contact¨; this

sound corresponds to the ¨generic impact sounds¨ subtype.
Tags: —
27. Noise-explosion: any sound related to an explosion (from any source). The sound
must be significant enough to be labeled as such. This subtype includes gunshots,
bombs, fireworks, etc.
Tags: gunshots, fireworks, bombs
28. Noise-liquid/water: this subtype represents any sound of liquids and water. The
liquid is the primary source of the sound; it must predominate. This subtype
includes ocean, river, lake, beach, shower, faucet, fountain, etc. Please consider
that if you identify more ambient noises and not only liquid or water sounds, you
must use other subtypes, such as öutdoor non-intelligible noise¨ or ¨domestic
sounds/home sounds.¨
Exception: this subtype does not include the rain sounds.
Tags: —
29. Noise-weather: this subtype includes wind, rain, thunderstorm, or lightning sounds.
In the case of wind, please add the tags ¨gusty¨ or ¨howling¨ to identify the
specific sound it produces.
Tags: wind, gusty, howling, thunderstorm, rain
30. Noise-fire: this subtype includes any sound of fire, or flames burning, in small or
large quantities.
Tags: —
31. Noise-glass: any sound related to glass, ceramics, and crystals. It can be the
sound of cutlery tapping on glass, the sound of glass rubbing, clinking glasses, etc.
Exception: if the sound represents an impact, please use the subtype

¨generic impact sounds¨, e.g., a glass object impacts another or hits the ground.
Tags: —
32. Noise-metal: this subtype represents any sound of metal objects. We exclude
metal doors since we include them in the ¨domestic sounds/home sounds¨ subtype.

¨generic impact sounds¨, e.g., a metal object impacts another or hits the ground.
Tags: —
33. Noise-paper: this subtype represents any sound of paper. We exclude food
packaging sounds since we include them in the ¨food packaging¨ ST.
Tags: —
34. Noise-wood: any sound related to wood. It can be cracking or creaking.

¨generic impact sounds¨, e.g., a wood object impacts another or hits the ground.
Tags: —
35. Noise-silence: The audio file does not contain sound. You playback the file, but
there is no content.
It is worth mentioning that this ST is as important as any other one in the

ontology. Consider that you will find audio files containing not only silence but
silence and other different sounds. You must always identify if the silence segment
predominates in the audio file (whether you find it at the beginning, middle, or end)
and annotate it using this ST.
Some examples would be:

● An audio file containing 5 seconds of silence and 2 seconds of a baby crying
sound. The silence predominates. You must annotate it as
noise-silence-some baby crying.
● An audio file containing 2 seconds of silence and 1 second of a

door-slamming sound. The silence predominates. You must annotate the file
as noise-silence-some door.
Tags: —
36. Noise-other: you will use it to label a noise file containing one of the following
cases.
a. A sound that you cannot include in the previous options, e.g., sound effects, gas
(gas leak sounds), spray (aerosols), electricity (electricity sound effects, short
circuits, etc.), glitch (errors, distortion, interference, etc.), zippers, toys, etc.
Some examples:
● Noise-other-sci-fi: This tag refers to science fiction (sci-fi) sounds. It

includes sound effects such as laser sound effects, spaceships, space guns,
etc.
● Noise-other-creatures: it refers to synthesized sounds of monsters or

creatures such as dragons, zombies, beasts, etc.
This tag does not include humans imitating these sounds. Remember
the ünintelligible vocal sounds¨ tag. It represents vocal sounds produced by
humans that are not considered speech, such as imitating animals, ghosts,
creatures, and cartoons.
● Noise-other-comic: this tag refers to synthesized sounds. Any sound of

cartoons.
● Noise-other-ghosts: it refers to the synthesized sounds of ghosts.

● Noise-other-sound effects: This tag refers to random or unidentified

sounds you cannot include in the previous options.
b. The mixture of two different noise subtypes where none of them predominates. A
few examples would be.
● ëating/digestive¨ and ¨food packaging¨: The audio file contains a person

eating (eating/digestive) and manipulating a plastic bag (food packaging) at
the same time.
● ¨bell¨ and ¨livestock/farm animals¨: The audio file contains a cow producing
sounds (livestock/farm animals) and a cowbell ringing (bell) at the same
time.
Tags: explain
Example: noise-other-eating, food packaging
noise-other-cow, bell
Mixed
______________________________________________________________________
Mixed: audio files containing a mixture of noise and speech.
1. Mixed-vocal music: this subtype represents audio files that contain someone
singing and music (electronic or instrumental) in the background.
Tags: —
Special cases:
● Echo effects: You can clearly hear complete words and sounds repeated several
times. We mostly find this type as a strong and distinct echo effect in music with
vocals (singing and music). However, we do not label these files as mixed-vocal
music. We mark them as mixed-other because the echo effect is considered
distortion or noise. Use high in the selection button.
● Humming: Refers to producing a continuous sound without opening the mouth. It

is commonly used as a warm-up when singing. It represents noise; you must
annotate it as human non-speech.
Some people confuse humming with non-lexical singing or vocalization.

Remember that it refers to singing nonsense syllables such as "la la la," "na na
na," or "da da da." These nonsense syllables represent singing and not
humming. As a result, you must annotate audio files containing such singing in
the following way:
○ Speech-acapella singing: singing without music

○ Mixed-vocal music: singing and music
○ Mixed-other: singing and humming
○ Noise-instrumental music-some humming: humming and music
● Another case would be beatboxing. It means the vocal imitation of musical

instruments. We also consider it mixed-vocal music.
● Choirs or more than one person singing:

○ Mixed-other: the file does not contain music.
○ Mixed-vocal music: the file contains music and an ensemble of singers.
2. Mixed-other: the audio file contains speech (words, parts of them regardless of the
language) and noise (any sound) or music (without singing).
Please notice, words such as üh,¨ öh,¨ ¨wow,¨ äh,¨ ëh,¨ ¨hey,¨ etc. are
considered speech.
Additionally, this ST includes overlapping speech/voices, which means a

clean speech file containing more than one speaker speaking at a time.
In cases where the audio file is mixed but has a continuous segment of at
least 4 seconds of clean speech, add the tags "male speech," "female speech,"
"children speech," or "acapella singing" as correspond. Additionally, consider cases
where most of the file consists of silence (treated as noise) and there is a
continuous 4-second segment of clean speech. We will cut the files labeled with
these tags and utilize the speech segment. If there is no continuous speech
segment, do not add a tag and proceed to the next file.
Some examples:
● 6 seconds noise + 4 seconds male speech = mixed-other-male speech
● 8 seconds noise + 4 seconds female speech = mixed-other-female speech
● 5 seconds female speech + 7 seconds noise = mixed-other-female speech
● 6 seconds silence + 6 seconds male speech = mixed-other-male speech
● 5 seconds acapella singing + 7 seconds silence = mixed-other-acapella

singing
● 4 seconds noise + 4 seconds silence + 4 seconds acapella singing =

mixed-other-acapella singing
Tags: —
¨Some X¨ Tags
______________________________________________________________________
We use the ¨some X¨ tags to identify all the sounds that do not predominate in an
audio file. You cannot ignore them as they are also relevant to our purpose. As stated
before, you will find audio files with more than one sound.
To annotate the audio files correctly, you must identify the sound that
predominates and represent it with a specific subtype while using the ¨some X¨ tags to
mark the less perceptible ones. This rule applies even if all the sounds belong to the
same ST.
Please check the following examples and review how you must annotate them:
● 80% dog + 20% cat = noise (MT)-domestic animals/pets (ST)-dog, some cat
(tags)
● 80% dog + 20% birds = noise (MT)-domestic animals/pets (ST)-dog, some

birds (tags)
● 80% dog + 10% bell + 10% vacuum = noise (MT)-domestic animals/pets

(ST)-dog, some bell, some vacuum (tags)
● 80% dog + 20% insects = noise (MT)-domestic animals/pets (ST)-dog, some

insects (tags)
● 80% blender + 20% dog = noise (MT)-domestic sounds/home sounds

(ST)-kitchen, blender, some dog (tags)
● 20% car + 80% cow = noise (MT)-livestock/farm animals (ST)-cow, some car
(tags)
● 80% silence + 20% clapping = noise (MT)-silence (ST)-some clapping (tags)
● 50% birds + 50% bell = noise (MT)-other (ST)-birds, bell (tags)
● 50% dog + 50% birds = noise (MT)-other (ST)-dog, birds (tags)
● 100% dog + birds = noise (MT)-other (ST)-dog, birds (tags)
In cases where the ST or tag represents a mixture of sounds, e.g., kitchen, group
cheering, outdoor non-intelligible noise, and indoor non-intelligible noise, please add the
¨some X¨ tags to identify the main sounds you hear in the audio file. Please, review the
following examples:
● 100% group cheering (from start to end) + 10, 30, or 50% clapping (only in one or
more parts) = noise (MT)-group cheering (ST)-some clapping (tags)
● 100% forest (from start to end) + 30% water + 5% birds (only in one or more
parts) = noise (MT)-outdoor non-intelligible noise (ST)-forest, some water,
some birds (tags)
● 100% street noise (from start to end) + 20% car + 5% horn + 60% wind (only in
one or more parts) = noise (MT)-outdoor non-intelligible noise (ST)-street
noise, some car, some horn, some wind (tags)
Specialty Sounds
______________________________________________________________________
There is a list of specific sounds that we pay special attention to. We call them
¨specialties.¨ We focus on these as they are frequently present during video
conferences and audio calls. The goal is to improve the audio quality and the user
experience when communicating. When adding the ¨some X¨ tags, please use the
following table with the specialty sounds as a reference:
Specialty Sound Tag ¨some X¨

1. babble some babble
2. baby crying some baby crying
3. bell some bell
4. bird some birds
5. car some car, some airplane, some boat, some motorcycle
6. clapping some clapping
7. dog barking some dog, some dog barking
8. door closing some door
9. eating some eating
10. fan/AC some fan, some air conditioning
11. food packaging some food packaging
12. horn and siren some horn, some siren, some alarm
13. keyboard striking some keyboard
14. kids screaming some kids screaming
15. kitchen sounds some kitchen, some dishwasher, some blender
16. low rev engines some low rev
17. musical instruments some music
18. running water some water
19. vacuum cleaner some vacuum
20. wind noise some wind
Reverberation
______________________________________________________________________
Reverberation is the prolongation or persistence of a sound after its production.

As the fundamental difference, it is unlike echo (repetition of the exact sound one or
more times) since the sound does not repeat itself. Also, it is naturally present in some
rooms, such as halls, empty rooms/spaces, auditoriums, churches, corridors, etc.
To learn more about this topic, please watch these videos:
Reverb VS Delay or Echo - What Is The Difference? Explanation, Comparis…

Echoes and reverberations
Echo vs Reverb: What's the difference
We have 3 types of reverberation: low, med, and high. Use the reverberation
level buttons (in the application) to label the audio files as follows.
1. Low: represents a non-existent or almost imperceptible reverberation.

2. Med: a standard level, a perceptible reverb that sounds natural.
3. High: a strong reverberation and sounds natural.
Special cases:
● Speech-strong reverb: a clean speech audio file (a conversation, reading, or

singing without sounds) with a very strong reverberation. The reverb is real and
natural (not fake or the echo where you can hear a distinct repetition of words or
sounds one or more times). It includes only one speaker at a time. Mark these
files as high.
Tags: male, female, children
● Echo effects: You can clearly hear complete words and sounds repeated several
times. We mostly find this type as a strong and distinct echo effect in music with
vocals (singing and music). However, we do not label these files as mixed-vocal
music. We mark them as mixed-other because the echo effect is considered
distortion or noise. Use high in the selection button.
Other important aspects to consider:
● Regardless of the case, it is necessary to use lowercase letters only (including

words such as english, spanish, french, italian, etc.).
● Do not use punctuation marks, special characters, or symbols (dash, dot,

comma, exclamation, question, or quotation marks).
● The ¨copy from above¨ button is a helpful tool. Do not misuse it. Use it only when
you are sure the file you are listening to needs to be annotated precisely like the
previous one.
● Consider and annotate all the audio files independently, even though they seem
related. You will find short audio files (0, 1, 2 seconds). In many cases, it gives
the impression that an audio file is the continuation of the previous one. Label the
file as corresponds and not as the previous one if it is not required.
● Label all the main sounds you identify in the audio files. Regardless of the
duration or volume, annotate the sounds as corresponds, e.g., an audio file
containing wind and bird sounds from beginning to end. The wind sound
predominates, while the bird sounds are less perceptible. It does not mean that
the bird sounds are less important. You must annotate both sounds as
corresponds.
● Remember, you must work from a noise-free environment and concentrate on

the task. It will allow you to recognize all the different sounds you have to
annotate.
Do not hesitate to ask questions and seek clarification when you have doubts.
We need you to understand this information to perform this task successfully. On the
contrary, we will ask you to re-annotate the files if they do not meet the criteria we
present in this material.
This material comprises all the details we require you to know before annotating
the audio files. The information is extensive. Therefore, we ask you to study and review
it carefully and usually.

Ontology 2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ontology 2023

Uploaded by

Copyright:

Available Formats

Audio Annotation

Our ontology is organized as follows:

1. A main type (MT): speech, noise, or mixed.

Please follow the instructions and study each case carefully.

The subtypes included in this MT are:

1. Speech-clean high pitched: someone speaking with a high-pitched voice. We

Tags: male, female + language + emotions

Tags: male, female + language + emotions

3. Speech-conversational/animated: different from reading, this subtype refers to a

Tags: male, female + language + emotions

4. Speech-children: it only includes children speaking spontaneously or reading. Use

Tags: language + emotions

5. Speech-acapella singing: someone singing without music or sounds. One singer

In the case of non-lexical singing or vocalization, which refers to singing

Tags: male, female, children + language + emotions

6. Speech-strong reverb: a clean speech audio file (a conversation, reading, or

Tags: male, female, children + language + emotions

Tags: male, female, children + language + emotions

Tags: male, female, children + emotions

9. Speech-unintelligible vocal sounds: a clean file that contains vocal sounds

Tags: male, female, or children

As we mentioned above, some subtypes require you to add tags to identify

The subtypes included in this MT are:

1. Noise-human (non-speech): this subtype includes the natural sounds humans

Do not consider this subtype in cases where other characters (animals,

Tags: breathing, snoring, coughing, yawning, burping, farting, whistling, sneezing,

2. Noise-clapping/hands: it contains clapping/applause (no cheering or whistling) or

Tags: baby, male, female, children, child, group

4. Noise-laughing: this subtype represents humans laughing. Small, big groups or

Tags: group, crowd, female, male, child, children.

5. Noise-screaming: it represents humans screaming for any reason. Do not

Tags: baby, kids, adults, horror

6. Noise-eating/digestive: it contains the sounds people make when eating or

Tags: crunchy, drinking, chewing, swallowing, slurping

7. Noise-food packaging: this subtype represents the sounds a person produces

Tags: alarm, siren, horn.

Tags: dog, dog barking, cat

11. Noise-livestock/farm animals: sounds produced by farm animals such as pigs,

Tags: pig, horse, goats, cow, bull, sheep, donkey

13. Noise-domestic sounds/home sounds: this subtype represents the sounds

In the case of a kitchen or bathroom, these spaces represent a mixture of

Exception: this subtype does not include a slamming door sound.

Tags: vacuum, bathroom, doors, hinge, washing machine, keys, kitchen +

In addition to automotive vehicles, non-automotive can be included (used for

When annotating a file containing cabin sounds such as train, airplane, or

This subtype includes the following sounds:

● Roaring: accelerating a vehicle and making a high-rev engine noise.

Tags: roaring, low rev, motors

This subtype represents a mixture of sounds, e.g., people shouting/yelling,

18. Noise-indoor non-intelligible noise: This subtype represents a mixture of sounds

It includes sounds from a not-so-busy location like a quiet office, where no

19. Noise-outdoor non-intelligible noise: This subtype represents a mixture of

It includes sounds from a not-so-busy location like a quiet park, where no

20. Noise-mechanisms/tools: this subtype represents all kinds of tools or

Tags: keyboard, mouse

Tags: fan, air conditioning

It is not necessary to add a tag. However, if you identify the musical

Exception: considering we have different subtypes to represent objects and

It is essential to highlight that hitting is not considered ¨surface contact¨; this

Tags: gunshots, fireworks, bombs

Exception: this subtype does not include the rain sounds.

Tags: wind, gusty, howling, thunderstorm, rain