You are on page 1of 6

游戏-外语视频标注规范

Game-foreign language video annotation specification


1. Judge whether it is valid
The whole qualified speech should be annotated sentence by sentence. The
following cases are judged as unqualified sentences and do not need to
be annotated:
 In a sentence, if the voices of two people overlap, and the volume is
close, and the overlapped parts are majority, it is marked as invalid
sentence; If there are few overlapped parts (only one or two words),
and the main speaker’s voice can be heard clearly, it needs to be
translated normally;
 If some parts of a sentence cannot be heard clearly and the content
cannot be judged, the sentence is invalid;
 If there is a very loud noise (environmental noise, equipment noise)
in a sentence, so that the main speaker can not be heard clearly, the
sentence is invalid;
 If there is frame loss in a sentence, the sentence is invalid;
 If a sentence is not a normal human voice (machine customer service,
synthetic voice, TV broadcast sound), the sentence is invalid;
 If there are other languages in a sentence, the sentence is invalid;
 The recording contains personal privacy information (sensitive
information), such as detailed address, mobile phone number, ID
number, bank card number, social security number, passport number,
etc;

2. Intercept valid parts


 According to semantic coherence, the tagger intercepts sentences by
sentence. Sentences that are too long can be split. Each sentence
should be no longer than 8 seconds. But not too short. According to
the annotation experience, each segment is 5-6 seconds on average;
 The best position for each boundary is at the lowest point of the
waveform;
 The speech of different speakers cannot be intercepted in the same
sentence;
 The tagger should try to leave a 0.2-0.3 silence segment around the
speech segment, but this is not mandatory if there is no long silence
segment. Taggers try to intercept speech segments without burst
noise. In order to avoid sudden noise, the tagger can shorten the
reserved time before and after the speech segment, but can not cut
the sound;
 If there is only one word that indicates the response, it also needs
to be intercepted. If it can be merged with an adjacent sentence, try
to merge it;
 If the speaker pauses, resulting in more than 2s of the silent
segment, you need to consider the integrity of the sentence meaning.
If the sentence meaning is still complete when it is cut into two
parts, it can be cut; If not, discuss it with the product manager. If
the pause time is less than two seconds and the sentence is no longer
than 8 seconds, it can be intercepted into one sentence;
 If a person pauses for no more than two seconds when speaking, but
there is noise in the middle of the pause, and the sentence is
incoherent after interception, it can not be split;
 If there is waveform clipping, truncation, frame loss, or abnormal
energy, the sentence is invalid.

3. Identifier of the speaker


In the same speech segment, different speakers should be marked with
different id, and their gender should be marked.

4. Content transcription
The tagger needs to transcript the content according to the audio he hears.
The transcription must be exactly the same as the speech. The transcription
content can not be more words, less words, wrong words. The general guidelines
are as follows:
1) Uppercase and lowercase letters: If the word usually starts with a
capital letter, use normal writing methods such as China, Microsoft.
2) Numbers: If there are numbers in the text, they cannot be translated
directly to Arabic numerals, but must be translated into the text of
the language.
原文 转写

I’m 15 years old I’m fifteen years old

3) Phonics type words: translate normally. There is no spaces between


letters
原文 Original Text 转写 Transcription
five thirty pm five thirty PM
FBI FBI

NFC NFC

4) Abbreviations: Do not use the abbreviation of the word. Be sure to use


the whole word of the pronounced word. For example:
原文 Original Text 转写 Transcription
This is Dr.Smith This is doctor Smith
5. Punctuation:
 Use punctuation according to grammar rules.
 Spoken punctuation needs to be translated, for example: "@"
translated as "at", ".com " translated as" dot com"
 Only commas(,), periods(.), exclamation marks(!), single quotation
marks(‘), and question marks(?) are allowed during translation. The
hyphen (-) can only appear in the middle of a word. Punctuation
outside of these may not be used. Punctuation must be grammatical.
Punctuation marks need to be used in English.

6. Modal prticle:
Modal words should be accurately translated according to pronunciation and
semantics (pure laughter does not need to be marked; However, if the modal
particle contributes to the meaning of the context, it must be marked. For
example: "Would you like to have dinner later?" "um." The "um" here is a
response to the above, it is semantic; If the modal particle does not
contribute to the meaning of the context, it does not need to be marked, anyway
it is not wrong to mark it, such as mindless humming.)

7. Other
 Swear words need to be translated normally, do not replace it with
letters;
 Internet hot words, common Internet words should be translated
according to common usage;
 If there are repeated words in the speech, translate all of them.
 If the tagger hears clearly and can determine the pronunciation, but
is not sure of the meaning, such as a common name, he can choose a
homophone instead. Taggers need to make sure that the text matches
the pronunciation. If the meaning of the context is clear, the tagger
should select the word that matches the pronunciation and the meaning
of the sentence;
 If a word is incomplete, the tagger should add a “-” after it and a
space between the following words, for example: I want to go to s-
school. Note that the end of the sentence must be a complete word. If
the unfinished word is at the end of the sentence, it cannot be
intercepted in the sentence.

8. Special label:
If any of the following situations occur, the tagger need to add special
labels.
(Label use must be reasonable: avoid missing pairs of labels, inconsistent
capitalization, and unpaired parentheses.)
Data Noise Special Description Role Transcripti
validation labels labeli on
ng
Valid data No noise no Transcribe O1 or Today I
the content O2… went to
heard eat.
according to
the standard
rule
[N] If a sentence Today I
contains went to
noise, it is eat.[N]
necessary to
mark [N] at
the end of
the sentence,
but it is not
necessary to
distinguish
the type of
noise.
[HM] Rapping and 一人我饮酒
singing 醉[HM]
should be
marked with
[HM] at the
end of the
sentence.
[OVERLAP/][/ If the speech Today I
OVERLAP] overlaps, but went to
one side is [OVERLAP/]
particularly eat
clear, the [/OVERLAP]
tagger
transcribes
only that
part. The
role labels
this speaker.
The other
affected part
of the text
marks the
label.
[OFFENSIVE/] Text that is [/
[/OFFENSIVE] affected by OFFENSIVE]
sensitive You're
content, blind.
including [/OFFENSIVE
uncomfortable ] I just
content made a big
related to move
politics,
opposition,
religion and
race,
pornography,
etc., is
marked with
this label
Invalid Recorder's [IVS] The tagger N [IVS]
data invalid should use
speech this label to
segment mark noise
segments
longer than
0.5 seconds.
For example,
the voice
overlaps, and
the voice
volume is
very similar;
Loss frame;
Speech
truncation;
Speech
echoes;
Not normal
speech tone:
such as
singing,
pinching the
voice to
speak, etc.;
Non-target
language;
There are
certain words
in the speech
segment that
are inaudible
or cannot be
transcribed
because of
noise.
Non- [OIVS] The tagger N [OIVS]
recorder’s needs to use
invalid this label to
speech mark noise
segment segments
longer than
0.5 seconds.
For example:
Television
voice;
Program
broadcast
Narration
Advertisement
;
Music with a
human voice;
Etc.

You might also like