Professional Documents
Culture Documents
Ontology 2023
Ontology 2023
Ontology
The purpose of the task is to annotate audio data based on labels and different
aspects. We will utilize this data to enhance speech-processing machine-learning
models. We will improve our software to remove all unwanted noises during audio calls
and video conferences. More specifically, our software will help to enhance speech
quality and improve digital communication.
To complete this activity, you will use our ontology to understand how to classify
and label audio files. This ontology will be your principal source of information.
Speech
______________________________________________________________________
Speech: audio files containing clean speech in any language and from one person
speaking at a time. Clean speech refers to audio files without any sounds, regardless of
their duration and level.
This main type does not include audio files with overlapping speech/voices,
which means a clean speech file containing more than one speaker speaking at a time.
To learn more about this topic, please watch the following video:
Exploring Jazz Vocals and Scat Singing
8. Speech-laughing: a completely clean file (without any kind of sounds). The file
contains natural or fake laughter, exaggerated or not. One person laughing at a
time. Consider that if the file includes laughter and other sounds, it must be
annotated as noise (you will find more information in the noise section).
10. Speech-other: a completely clean file that contains any of these two cases:
● Any other clean speech example you cannot include in the previous options.
● The mixture of two different speech subtypes but none of them predominates.
One example would be conversational/animated and whispering. Half of the
audio file contains a person speaking spontaneously (conversational/animated),
and in the other half, the person whispers (whispering).
Tags: explain
For example, speech-other-conversational, whispering
● Language: you will find audio files in any language. If you identify it, please add
the tag.
Tags: english, spanish, french, italian, etc.
● Emotions: add the tag to identify the emotion expressed in the audio file.
Tags: angry, fearful, sad, excited, happy, disgusted, bored, mad, spontaneous,
disappointed, enthusiastic, neutral
Noise
______________________________________________________________________
Noise: an audio file containing sounds of any kind, duration, or level. It does not include
intelligible speech (words or part of them) regardless of the language, duration, or level.
Tags: —
3. Noise-crying: this subtype represents humans crying for any reason (natural or
fake). Do not consider this ST in cases where other characters (animals, ghosts,
creatures, cartoons, etc.) produce this sound. Use the ST ¨other¨ if you hear a
ghost, creature, or cartoon crying.
It is essential to highlight that the audio file must not contain speech. The
screams must not include words. If that is the case, you must annotate the file as
mixed-other.
Tags: opening bags, opening bottles, squashing bottles, crushing cans, crushing
food bags
8. Noise-alarm/siren/horn: this subtype represents all kinds of alarm, siren, and horn
sounds. The alarm, siren, or horn sound must predominate in the audio file. If not,
determine if you need to use another subtype such as ¨indoor non-intelligible
noise¨, ¨outdoor non-intelligible noise¨, ¨vehicle¨, etc.
9. Noise-bell: this subtype represents all kinds of bells, ringtones, polyphonic sounds
(cell phones), or chimes. The bell sound must predominate in the audio file. If not,
determine if you need to use another subtype such as indoor or outdoor
non-intelligible noise, ¨domestic sounds/home sounds,¨ etc.
Tags: —
10. Noise-domestic animals/pets: sounds produced by domestic animals such as
dogs and cats. This subtype includes sounds produced with their mouth and limbs.
Do not specify the type of sound the animal produces. The only exception is
barking; please add the tag ¨dog barking¨ to label this specific sound.
This ST does not include birds (rooster, duck, hen, chicken, owl, eagle, crow,
or any other).
This ST does not include birds (rooster, duck, hen, chicken, owl, eagle, crow,
or any other).
12. Noise-wild animals: sounds produced by wild animals such as birds, rats, snakes,
insects, flies, bees, crickets, wasps, toads, monkeys, elephants, lions, tigers, bears,
dolphins, and whales, among others. This subtype includes sounds produced with
their mouth and limbs.
● Type of birds: rooster, duck, hen, chicken, owl, eagle, crow, etc. You do not
need to specify the type of bird in the tag. Label all the types of birds as
¨birds¨ and continue with the following file.
Tags: birds, rat, snake, insects, bee, fly, crickets, elephant, wasp, frog, monkey,
dolphin, lion, tiger, bear, wolf, hyena, elk, sea lion, bat, whale
To annotate these audio files correctly, please add the tag ¨kitchen¨ and
identify the object you hear, e.g., oven, dishes, knives, frying, chopping, mixer, etc.
This ST includes all kinds of door sounds (including car, metal, sliding doors,
etc.).
14. Noise-vehicle: this subtype represents vehicle sounds, their cabins, or their parts
except their engines, motors, or doors. The sounds of these specific parts
correspond to the subtypes ¨engine¨ (engines and motors) and the ¨domestic
sounds/home sounds¨ (all kinds of doors).
Tags: train cabin, airplane cabin, car cabin, car, airplane, motorcycle, bike, horse
carriage, train, cart, skateboard, stretcher, scooter, tractor, truck, bus, helicopter,
carriage, boat, ambulance
15. Noise-engine: this subtype represents the sound of engines and motors (of any
kind) without other vehicle sounds.
16. Noise-group babble: the sounds a group (two or more people) or crowd produces
when speaking simultaneously, e.g., at a theater, airport, food court, etc. Nothing of
what the speakers say can be understood (no words or parts of them regardless of
the language, duration, or level), only the babble sounds. Therefore, if an audio file
contains babble sounds, but you also hear a word or part of it, mark the file as
mixed-other. Remember that the mixture of noise and speech results in a mixed file.
Tags: —
17. Noise-group cheering: the sounds a group (two or more people) or crowd
produces when cheering (expressing joy or support) or booing (showing
displeasure).
When adding the tags, please label the sounds you identify in the audio file,
e.g., in the case of group cheering: some clapping, some screaming, some music,
some laughing, etc. In the case of group booing: booing and some screaming,
some music, some laughing, etc.
Tags: children, small group, crowd, adults, booing, some screaming, some music,
some laughing, etc.
When adding the tag, please label the indoor space and the sounds you
identify in the audio file, e.g., in the case of an office: some keyboard, some printer,
some phone, some babble, etc.
Tags: sports events, cafe, restaurant, pool, school, hospital, office, room, house,
laundry room, living room, garage, church, subway, museum, library, building,
bank, hall, casino, etc.
The list of spaces is diverse and broad. Verifying that the sound corresponds
to one of the following tags is recommended. If the location you identify is not on
the list, please add a new tag if necessary.
When adding the tags, please label the outdoor space and the sounds you
identify in the audio file, e.g., in the case of a forest: some water, some birds, some
insects, some wind, etc.
Tags: street noise, park, stadium, theme park, playground, parking lot, square,
train station, fair, mountain, desert, beach, church, construction, forest, jungle,
harbor, etc.
Tags: dremel, servo motor, hand-drill, hammer, jackhammer, saw, hand saw,
electric saw, electric razor, hair trimmer, shaver, sander, ratchet, electric
screwdriver, compressor, wrenches, screws, shovel, lawn mower, spatula, chain,
rope, welding machine, hydraulic pump, digger, etc.
21. Noise-devices: this subtype includes any sound of medical, digital, or office
devices, e.g., heartbeat or blood pressure monitors, clocks, cameras, printers,
faxes, typewriters, etc.
Tags: defibrillator, blood pressure monitor, nebulizer, heart rate monitor, chemical
analyzer, clock, camera, printer, scanner, fax, photocopier, cash register,
typewriter, amplifier, adding machine.
22. Noise-keyboard/mouse: this subtype includes all computer keyboards and mouse
click sounds.
23. Noise-fan/air conditioning: this subtype includes all the sounds of a fan or an air
conditioner. Different from the wind sounds. The device is the primary source of the
sound.
24. Noise-instrumental music: this subtype includes any kind of music (with or without
sound effects), musical notes, and sounds produced by any musical instrument but
without speech or singing. If the audio file contains singing and music, you must
annotate it as mixed-vocal music.
Tags: —
25. Noise-generic impact sounds: this subtype represents soft, moderate, or loud
impact sounds produced by objects or actions (not an explosion), e.g., the sound of
a slamming door, one thing hitting another, an object falling to the ground, etc.
26. Noise-surface contact: this subtype represents soft, moderate, or loud sounds
someone produces when rubbing, touching, or manipulating an object, fabric, or
surface, e.g., sliding, rolling, or moving an object on a table or other surfaces.
Tags: —
27. Noise-explosion: any sound related to an explosion (from any source). The sound
must be significant enough to be labeled as such. This subtype includes gunshots,
bombs, fireworks, etc.
28. Noise-liquid/water: this subtype represents any sound of liquids and water. The
liquid is the primary source of the sound; it must predominate. This subtype
includes ocean, river, lake, beach, shower, faucet, fountain, etc. Please consider
that if you identify more ambient noises and not only liquid or water sounds, you
must use other subtypes, such as ¨outdoor non-intelligible noise¨ or ¨domestic
sounds/home sounds.¨
Tags: —
29. Noise-weather: this subtype includes wind, rain, thunderstorm, or lightning sounds.
In the case of wind, please add the tags ¨gusty¨ or ¨howling¨ to identify the
specific sound it produces.
30. Noise-fire: this subtype includes any sound of fire, or flames burning, in small or
large quantities.
Tags: —
31. Noise-glass: any sound related to glass, ceramics, and crystals. It can be the
sound of cutlery tapping on glass, the sound of glass rubbing, clinking glasses, etc.
Tags: —
32. Noise-metal: this subtype represents any sound of metal objects. We exclude
metal doors since we include them in the ¨domestic sounds/home sounds¨ subtype.
Tags: —
33. Noise-paper: this subtype represents any sound of paper. We exclude food
packaging sounds since we include them in the ¨food packaging¨ ST.
Tags: —
Tags: —
35. Noise-silence: The audio file does not contain sound. You playback the file, but
there is no content.
Tags: —
36. Noise-other: you will use it to label a noise file containing one of the following
cases.
a. A sound that you cannot include in the previous options, e.g., sound effects, gas
(gas leak sounds), spray (aerosols), electricity (electricity sound effects, short
circuits, etc.), glitch (errors, distortion, interference, etc.), zippers, toys, etc.
Some examples:
This tag does not include humans imitating these sounds. Remember
the ¨unintelligible vocal sounds¨ tag. It represents vocal sounds produced by
humans that are not considered speech, such as imitating animals, ghosts,
creatures, and cartoons.
This tag does not include humans imitating these sounds. Remember
the ¨unintelligible vocal sounds¨ tag. It represents vocal sounds produced by
humans that are not considered speech, such as imitating animals, ghosts,
creatures, and cartoons.
b. The mixture of two different noise subtypes where none of them predominates. A
few examples would be.
● ¨bell¨ and ¨livestock/farm animals¨: The audio file contains a cow producing
sounds (livestock/farm animals) and a cowbell ringing (bell) at the same
time.
Tags: explain
Example: noise-other-eating, food packaging
noise-other-cow, bell
Mixed
______________________________________________________________________
1. Mixed-vocal music: this subtype represents audio files that contain someone
singing and music (electronic or instrumental) in the background.
Tags: —
Special cases:
● Echo effects: You can clearly hear complete words and sounds repeated several
times. We mostly find this type as a strong and distinct echo effect in music with
vocals (singing and music). However, we do not label these files as mixed-vocal
music. We mark them as mixed-other because the echo effect is considered
distortion or noise. Use high in the selection button.
2. Mixed-other: the audio file contains speech (words, parts of them regardless of the
language) and noise (any sound) or music (without singing).
Please notice, words such as ¨uh,¨ ¨oh,¨ ¨wow,¨ ¨ah,¨ ¨eh,¨ ¨hey,¨ etc. are
considered speech.
Some examples:
Tags: —
¨Some X¨ Tags
______________________________________________________________________
We use the ¨some X¨ tags to identify all the sounds that do not predominate in an
audio file. You cannot ignore them as they are also relevant to our purpose. As stated
before, you will find audio files with more than one sound.
To annotate the audio files correctly, you must identify the sound that
predominates and represent it with a specific subtype while using the ¨some X¨ tags to
mark the less perceptible ones. This rule applies even if all the sounds belong to the
same ST.
Please check the following examples and review how you must annotate them:
● 80% dog + 20% cat = noise (MT)-domestic animals/pets (ST)-dog, some cat
(tags)
● 20% car + 80% cow = noise (MT)-livestock/farm animals (ST)-cow, some car
(tags)
In cases where the ST or tag represents a mixture of sounds, e.g., kitchen, group
cheering, outdoor non-intelligible noise, and indoor non-intelligible noise, please add the
¨some X¨ tags to identify the main sounds you hear in the audio file. Please, review the
following examples:
● 100% group cheering (from start to end) + 10, 30, or 50% clapping (only in one or
more parts) = noise (MT)-group cheering (ST)-some clapping (tags)
● 100% forest (from start to end) + 30% water + 5% birds (only in one or more
parts) = noise (MT)-outdoor non-intelligible noise (ST)-forest, some water,
some birds (tags)
● 100% street noise (from start to end) + 20% car + 5% horn + 60% wind (only in
one or more parts) = noise (MT)-outdoor non-intelligible noise (ST)-street
noise, some car, some horn, some wind (tags)
Specialty Sounds
______________________________________________________________________
There is a list of specific sounds that we pay special attention to. We call them
¨specialties.¨ We focus on these as they are frequently present during video
conferences and audio calls. The goal is to improve the audio quality and the user
experience when communicating. When adding the ¨some X¨ tags, please use the
following table with the specialty sounds as a reference:
We have 3 types of reverberation: low, med, and high. Use the reverberation
level buttons (in the application) to label the audio files as follows.
Special cases:
● Echo effects: You can clearly hear complete words and sounds repeated several
times. We mostly find this type as a strong and distinct echo effect in music with
vocals (singing and music). However, we do not label these files as mixed-vocal
music. We mark them as mixed-other because the echo effect is considered
distortion or noise. Use high in the selection button.
Other important aspects to consider:
● The ¨copy from above¨ button is a helpful tool. Do not misuse it. Use it only when
you are sure the file you are listening to needs to be annotated precisely like the
previous one.
● Consider and annotate all the audio files independently, even though they seem
related. You will find short audio files (0, 1, 2 seconds). In many cases, it gives
the impression that an audio file is the continuation of the previous one. Label the
file as corresponds and not as the previous one if it is not required.
● Label all the main sounds you identify in the audio files. Regardless of the
duration or volume, annotate the sounds as corresponds, e.g., an audio file
containing wind and bird sounds from beginning to end. The wind sound
predominates, while the bird sounds are less perceptible. It does not mean that
the bird sounds are less important. You must annotate both sounds as
corresponds.
Do not hesitate to ask questions and seek clarification when you have doubts.
We need you to understand this information to perform this task successfully. On the
contrary, we will ask you to re-annotate the files if they do not meet the criteria we
present in this material.
This material comprises all the details we require you to know before annotating
the audio files. The information is extensive. Therefore, we ask you to study and review
it carefully and usually.