Brazilian Portuguese

Collection and Annotation of Colloquial Video Data:
Brazilian Portuguese
Project Manager: Ms. Jingping “Vicky” Wei

weijingping@datatang.com
Datatang Technology
Brazilian Portuguese Datatang
I. Project Overview
The goal of this project is to collect Brazilian Portuguese colloquial video data from public
video websites and to transcribe the audio extracted from the video data. The minimum effective
data duration required for this project is 500 hours.
Effective data duration is the sum of the duration of all valid clips of data. Based on previous
experience, output rate is estimated at 70%.
effective data duration

output rate = ≈ 70%
total duration of data collected
II. Data Collection
A. Target Language
Standard Brazilian Portuguese. Minor regional accents are acceptable.
B. Style of Utterance
In the videos collected from public websites, the speakers should speak naturally instead of
carefully enunciating each word (such as the style of news broadcast, audiobooks, etc.). Avoid
data from theaters and cinematographic productions where speech is exaggerated. Also, avoid
videos with perpetual background noise and/or music. Videos with closed captions or subtitles
are preferred.
C. Data Qualification
The following types of data should be removed during the collection process:
1. Data with an obvious style of reading or a strong accent;
2. Data in which the primary language spoken is not Brazilian Portuguese;
3. Data with poor sound quality, perpetual background music or noisy environment;
4. Data where the speaker is not physically present (from a phone call, in a recording, etc.).
1
D. Data Format Requirements
1. Video Data
Video data should be saved as .mp4 files. The files should be named with six-digit incre-
mental serial numbers — i.e., 000001, 000002, 000003, etc. — with the proper file extension
(“.mp4” in this case). Please document all relevant original information, including the video
URL, original duration, the content category, and the name of program. Examples of appropri-
ate categories include but are not limited to the following:
• interviews
• variety shows
• live streaming
• lectures
• sports
Please discuss with the product manager about any additional categories that might be suitable
for the project.
2. Audio Data
Audio data should be extracted from the video files and saved as 16 kHz, 16 bit, mono, .wav
files.
The name of each audio file should be consistent with the name of the corresponding video
file (see Section II.D.1.) but with the proper file extension (“.wav” in this case).
3. Transcription Data
The transcribed content should be stored in a .txt file, with each line containing information
of one clip. Each line should include:
1. time stamp of the clip (start time and end time);
2. the Speaker ID tag (see Section IV.C.);
3. the text transcription.
The pieces of information in each line should be separated by a tab.
The name of each .txt file should be consistent with the name of the corresponding video
file (see Section II.D.1.) but with the proper file extension (“.txt” in this case).
2
4. Metadata
There should be a .metadata file (V1.2 version) that documents the following information
for each video file:
1. the Speaker ID tag and the gender of each speaker in the video (see Section IV.C.);
2. the category of the video (see Section II.D.1.).
The name of each .metadata file should be consistent with the name of the corresponding
video file (see Section II.D.1.) but with the proper file extension (“.metadata” in this case).
III. Data Organization

Data should be organized in the structure shown below:
IV. Data Annotation
A. Phrase Clipping
Annotators should use the segmentation tool in Shujiajia to mark and clip spoken phrases
in each audio file.
Ideally, each clip should only include the voice of a single speaker, and voices of different
speakers should not be clipped into the same sentence.
3
Each clip should not exceed 8 seconds, but it should not be too short, either. Based on
previous experience, the average duration of a clip is usually 5 to 6 seconds long. If the duration
of a clause exceeds 8 seconds, the clause may be divided into smaller clauses. If a phrase is
consisted of merely a single word, it is not necessary to clip it as an individual phrase. If it
can be merged with the preceeding/succeeding phrase with respect to semantic coherence, then
merge the two phrases; if not, the word may be disregarded and excluded from the clip.
Annotators should consider semantic coherence during the process — i.e., the smallest unit
of any clip should be a complete clause. However, if a phrase contains a pause that exceeds
2 seconds, then under this circumstance, the phrase must be divided into two separate clips
regardless of semantic coherence. If the pause is shorter than 2 seconds, the phrase should be
captured in one clip as long as the total duration does not exceed 8 seconds.
It would be ideal to include a 0.2-0.3 second buffer zone at each end of the clip, though this
is not required. A clip must not begin with sudden noise, and the buffer zone before and after
the phrase may be shortened to avoid sudden noise. No speech segment may be cut out of the
clip — i.e., the speech in each clip must be intact.
B. Clip Validity
After clipping, the validity of each clip should be evaluated. A clip is deemed invalid under
the following circumstances:
1. If the voices of two speakers overlap for most part and the volumes of their speech are
similar;
• If there is only a minor overlap (only one or two words), and the speech of the
main speaker can be heard clearly, the clip is deemed valid.
2. If part of the phrase is inaudible and/or the content is uninterpretable;
3. If there is strong noise (environmental noise, equipment noise) that obscures the speech
of the main speaker;
4. If the audio data for a phrase is distorted, discontinuous, and/or otherwise damaged;
5. If the speaker is not human (e.g., virtual assistant, synthetic voice, etc.);
6. If a phrase is primarily spoken in a language other than Brazilian Portuguese;
7. If a phrase involves sensitive information (politically sensitive information, religiously
sensitive information, private identifiable information such as ID number and street ad-
dress, pornographic content, etc.).
4
C. Speaker Identification
For each valid clip, mark its main speaker with a Speaker ID tag and specify their gender.
For any recording, the ID assignment for the speakers must stay consistent throughout the
recording.
D. Audio-to-Text Transcription
The annotators should transcribe the content as it is in the audio. Ensure that the transcribed
content exactly matches the audio file. There should not be any additional, absent, or incorrect
words. The general guidelines are as follows.
1. Numbers
Numbers need to be transcribed into the corresponding words based on how they are said
in the audio, not Arabic numerals.
2. Alphabetical Letters
If a speaker says a series of letters, then in the transcription, the letters should be capitalized
and separated by a space.
3. Interjections
Interjections should be transcribed accurately based on semantics and the actual pronunci-
ation.
4. Punctuations
Only comma, period, hyphen, apostrophe, question mark, and exclamation mark are allowed
to appear in the transcription. Punctuations other than these six are not allowed in the transcrip-
tion. The transcription of punctuations must conform to grammatical rules. The punctuations
need to be typed in the standard Brazilian Portuguese input mode.
Punctuations spoken by the speaker need to be transcribed as words. For example, “@”
should be transcribed as “at” and “.com” should be transcribed as “dot com” or “dot C O M”
depending on the actual pronunciation.
5
5. Others
• Swear words should be transcribed faithfully and not be replaced;

• Internet buzzwords and common internet words should be transcribed based on their
common usage;
• Repeated words should all be transcribed;
• Words whose semantic meaning is unclear but pronunciation can be confirmed may be
transcribed based on homophones;
• If there is a pause in a word, add a hyphen and a space (i.e., “- ”) after the word, but the
end of the sentence must be a complete word. If the unfinished word is at the end of the
sentence, it should be excluded from the clip.
– e.g. “I want to go to s- school.”
If an exceptional case occurs or whenever disambiguation is desired, please do not hesitate
to contact the project manager for clarification on the transcription guidelines.
V. Data Acceptance
Before the submitted dataset is accepted, native speakers of Brazilian Portuguese will par-
ticipate in the final inspection stage. The same batch of data should be checked and accepted
by at least two native speakers. The acceptance results then will be integrated to determine the
overall transcription accuracy.
The overall transcription accuracy should be 98% or higher for words and 100% for punc-
tuations.
amount of incorrectly transcribed words

overall transcription accuracy = (1 − ) × 100%
amount of all transcribed words
Before submitting the dataset for quality inspection, please first perform a self-examination
to eliminate the following errors. If such issues occur in the final submission, the entire sub-
mission is deemed unsatisfactory and will be rejected.
1. If the length of an audio file does not match the length of its corresponding video file;
2. If there are unacceptable characters in the text transcription. Only the following char-
acters are deemed acceptable in this project:
• letters in the Brazilian Portuguese alphabet
6
• space character
• allowed punctuations (see Section IV.D.4.)
3. If there are multiple consecutive space characters;
4. If there is any space character at the beginning and/or the end of any clause transcription;
5. If there are Arabic numerals in the annotation;
6. If corresponding .metadata files and .txt files are absent for any audio file;
7. If the time stamp contains negative numbers in any .txt file;
8. If there are erroneous line wraps in any .metadata file;
9. If the audio file is smaller than 15 kilobytes;
10. If the final submission does not follow the required organization structure (see Sec-
tion III.).

Brazilian Portuguese

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Brazilian Portuguese

Uploaded by

Copyright:

Available Formats

Collection and Annotation of Colloquial Video Data:

Project Manager: Ms. Jingping “Vicky” Wei

effective data duration

II. Data Collection

Standard Brazilian Portuguese. Minor regional accents are acceptable.

D. Data Format Requirements

III. Data Organization

IV. Data Annotation

• Swear words should be transcribed faithfully and not be replaced;

amount of incorrectly transcribed words

You might also like