You are on page 1of 8

Collection and Annotation of Colloquial Video Data:

Brazilian Portuguese

Project Manager: Ms. Jingping “Vicky” Wei


weijingping@datatang.com

Datatang Technology
Brazilian Portuguese Datatang

I. Project Overview
The goal of this project is to collect Brazilian Portuguese colloquial video data from public
video websites and to transcribe the audio extracted from the video data. The minimum effective
data duration required for this project is 500 hours.
Effective data duration is the sum of the duration of all valid clips of data. Based on previous
experience, output rate is estimated at 70%.

effective data duration


output rate = ≈ 70%
total duration of data collected

II. Data Collection

A. Target Language

Standard Brazilian Portuguese. Minor regional accents are acceptable.

B. Style of Utterance

In the videos collected from public websites, the speakers should speak naturally instead of
carefully enunciating each word (such as the style of news broadcast, audiobooks, etc.). Avoid
data from theaters and cinematographic productions where speech is exaggerated. Also, avoid
videos with perpetual background noise and/or music. Videos with closed captions or subtitles
are preferred.

C. Data Qualification

The following types of data should be removed during the collection process:
1. Data with an obvious style of reading or a strong accent;
2. Data in which the primary language spoken is not Brazilian Portuguese;
3. Data with poor sound quality, perpetual background music or noisy environment;
4. Data where the speaker is not physically present (from a phone call, in a recording, etc.).

1
Brazilian Portuguese Datatang

D. Data Format Requirements

1. Video Data

Video data should be saved as .mp4 files. The files should be named with six-digit incre-
mental serial numbers — i.e., 000001, 000002, 000003, etc. — with the proper file extension
(“.mp4” in this case). Please document all relevant original information, including the video
URL, original duration, the content category, and the name of program. Examples of appropri-
ate categories include but are not limited to the following:
• interviews
• variety shows
• live streaming
• lectures
• sports
Please discuss with the product manager about any additional categories that might be suitable
for the project.

2. Audio Data

Audio data should be extracted from the video files and saved as 16 kHz, 16 bit, mono, .wav
files.
The name of each audio file should be consistent with the name of the corresponding video
file (see Section II.D.1.) but with the proper file extension (“.wav” in this case).

3. Transcription Data

The transcribed content should be stored in a .txt file, with each line containing information
of one clip. Each line should include:
1. time stamp of the clip (start time and end time);
2. the Speaker ID tag (see Section IV.C.);
3. the text transcription.
The pieces of information in each line should be separated by a tab.
The name of each .txt file should be consistent with the name of the corresponding video
file (see Section II.D.1.) but with the proper file extension (“.txt” in this case).

2
Brazilian Portuguese Datatang

4. Metadata

There should be a .metadata file (V1.2 version) that documents the following information
for each video file:
1. the Speaker ID tag and the gender of each speaker in the video (see Section IV.C.);
2. the category of the video (see Section II.D.1.).
The name of each .metadata file should be consistent with the name of the corresponding
video file (see Section II.D.1.) but with the proper file extension (“.metadata” in this case).

III. Data Organization


Data should be organized in the structure shown below:

IV. Data Annotation

A. Phrase Clipping

Annotators should use the segmentation tool in Shujiajia to mark and clip spoken phrases
in each audio file.
Ideally, each clip should only include the voice of a single speaker, and voices of different
speakers should not be clipped into the same sentence.

3
Brazilian Portuguese Datatang

Each clip should not exceed 8 seconds, but it should not be too short, either. Based on
previous experience, the average duration of a clip is usually 5 to 6 seconds long. If the duration
of a clause exceeds 8 seconds, the clause may be divided into smaller clauses. If a phrase is
consisted of merely a single word, it is not necessary to clip it as an individual phrase. If it
can be merged with the preceeding/succeeding phrase with respect to semantic coherence, then
merge the two phrases; if not, the word may be disregarded and excluded from the clip.
Annotators should consider semantic coherence during the process — i.e., the smallest unit
of any clip should be a complete clause. However, if a phrase contains a pause that exceeds
2 seconds, then under this circumstance, the phrase must be divided into two separate clips
regardless of semantic coherence. If the pause is shorter than 2 seconds, the phrase should be
captured in one clip as long as the total duration does not exceed 8 seconds.
It would be ideal to include a 0.2-0.3 second buffer zone at each end of the clip, though this
is not required. A clip must not begin with sudden noise, and the buffer zone before and after
the phrase may be shortened to avoid sudden noise. No speech segment may be cut out of the
clip — i.e., the speech in each clip must be intact.

B. Clip Validity

After clipping, the validity of each clip should be evaluated. A clip is deemed invalid under
the following circumstances:
1. If the voices of two speakers overlap for most part and the volumes of their speech are
similar;
• If there is only a minor overlap (only one or two words), and the speech of the
main speaker can be heard clearly, the clip is deemed valid.
2. If part of the phrase is inaudible and/or the content is uninterpretable;
3. If there is strong noise (environmental noise, equipment noise) that obscures the speech
of the main speaker;
4. If the audio data for a phrase is distorted, discontinuous, and/or otherwise damaged;
5. If the speaker is not human (e.g., virtual assistant, synthetic voice, etc.);
6. If a phrase is primarily spoken in a language other than Brazilian Portuguese;
7. If a phrase involves sensitive information (politically sensitive information, religiously
sensitive information, private identifiable information such as ID number and street ad-
dress, pornographic content, etc.).

4
Brazilian Portuguese Datatang

C. Speaker Identification

For each valid clip, mark its main speaker with a Speaker ID tag and specify their gender.
For any recording, the ID assignment for the speakers must stay consistent throughout the
recording.

D. Audio-to-Text Transcription

The annotators should transcribe the content as it is in the audio. Ensure that the transcribed
content exactly matches the audio file. There should not be any additional, absent, or incorrect
words. The general guidelines are as follows.

1. Numbers

Numbers need to be transcribed into the corresponding words based on how they are said
in the audio, not Arabic numerals.

2. Alphabetical Letters

If a speaker says a series of letters, then in the transcription, the letters should be capitalized
and separated by a space.

3. Interjections

Interjections should be transcribed accurately based on semantics and the actual pronunci-
ation.

4. Punctuations

Only comma, period, hyphen, apostrophe, question mark, and exclamation mark are allowed
to appear in the transcription. Punctuations other than these six are not allowed in the transcrip-
tion. The transcription of punctuations must conform to grammatical rules. The punctuations
need to be typed in the standard Brazilian Portuguese input mode.
Punctuations spoken by the speaker need to be transcribed as words. For example, “@”
should be transcribed as “at” and “.com” should be transcribed as “dot com” or “dot C O M”
depending on the actual pronunciation.

5
Brazilian Portuguese Datatang

5. Others

• Swear words should be transcribed faithfully and not be replaced;


• Internet buzzwords and common internet words should be transcribed based on their
common usage;
• Repeated words should all be transcribed;
• Words whose semantic meaning is unclear but pronunciation can be confirmed may be
transcribed based on homophones;
• If there is a pause in a word, add a hyphen and a space (i.e., “- ”) after the word, but the
end of the sentence must be a complete word. If the unfinished word is at the end of the
sentence, it should be excluded from the clip.
– e.g. “I want to go to s- school.”
If an exceptional case occurs or whenever disambiguation is desired, please do not hesitate
to contact the project manager for clarification on the transcription guidelines.

V. Data Acceptance
Before the submitted dataset is accepted, native speakers of Brazilian Portuguese will par-
ticipate in the final inspection stage. The same batch of data should be checked and accepted
by at least two native speakers. The acceptance results then will be integrated to determine the
overall transcription accuracy.
The overall transcription accuracy should be 98% or higher for words and 100% for punc-
tuations.

amount of incorrectly transcribed words


overall transcription accuracy = (1 − ) × 100%
amount of all transcribed words

Before submitting the dataset for quality inspection, please first perform a self-examination
to eliminate the following errors. If such issues occur in the final submission, the entire sub-
mission is deemed unsatisfactory and will be rejected.
1. If the length of an audio file does not match the length of its corresponding video file;
2. If there are unacceptable characters in the text transcription. Only the following char-
acters are deemed acceptable in this project:
• letters in the Brazilian Portuguese alphabet

6
Brazilian Portuguese Datatang

• space character
• allowed punctuations (see Section IV.D.4.)
3. If there are multiple consecutive space characters;
4. If there is any space character at the beginning and/or the end of any clause transcription;
5. If there are Arabic numerals in the annotation;
6. If corresponding .metadata files and .txt files are absent for any audio file;
7. If the time stamp contains negative numbers in any .txt file;
8. If there are erroneous line wraps in any .metadata file;
9. If the audio file is smaller than 15 kilobytes;
10. If the final submission does not follow the required organization structure (see Sec-
tion III.).

You might also like