Professional Documents
Culture Documents
Guidelines For Indonesian Subtitle ASR
Guidelines For Indonesian Subtitle ASR
B. Overall Rule
1. Just transcribe the text corresponding to the Indonesian voice, and transcribe it
according to the actual pronunciation (applicable to slang, numbers, symbols, URLs, e-
mail addresses, etc.).
2. For Non-Indonesian speech, unsure whether or not Indonesian speech, inaudible or
incomprehensible Indonesian fragments, it would be NEITHER segmented NOR
transcribed.
3. If the entire fragment is the local dialect like Javanese and Sudanese, there is NO need to
segment or transcribe; but if there are a small number of English words, dialect or
internet slang within the semantically comprehensible range of a single sentence, then it
needs to be transcribed.
4. There is NO need to correct grammar mistakes, just try to transcribe the actual
pronunciation.
C. Transcribe Rules
1. Punctuation Marks
a. Overall Rules
Add punctuation marks according to the meaning of sentence. If there is a pause after one clause,
there must be a punctuation mark.
[Note: Half-width format is required when transcribing, so please use the symbols of the
Indonesian input method.]
b. Category
Transcribe commas, periods, question marks, exclamation marks and hyphens, others DO NOT
need to be transcribed.
c. Hyphen (-)
When two words are superposed to form a compound word, it will express a new
meaning. A hyphen "-" is required.
Particle "Lah". In some cases, when the words before "lah" are names, countries, cities,
English words, hypen "-" must be added when writing.
d. Space
The punctuation mark immediately follows the preceding word and is separated from the following
word by a space.
For example: "Kamu menang, kaku kalah."
3.Numbers
Indonesian Arabic Number
Number
kosong / nol 0
satu 1
dua 2
tiga 3
empat 4
lima 5
enam 6
tujuh 7
delapan 8
sembilan 9
Generally speaking, there're two main ways to read numbers.
The first one is read as numerical value, such as 123 is read as one hundred and twenty-three.
The second one is read as a number, such as 123 is read as one two three.
a. The numbers are directly transcribed into the corresponding Indonesian pronunciation.
Space needs to be added between Indonesian numerals.
For example, 123 read as numerical value should be transcribed as seratus dua puluh tiga; 1234
read as a number should be transcribed as as stu dua tiga empat.
d. Use "," for decimal point (comma,if it is read “koma”, transcribed as koma directly).
Use "." for quantile (period, if it is read “titik”, transcribed as titik directly).
e. Percentages, fractions, decimals (such as 50%, 1/3, 5.1) are transcribed into corresponding
words instead of Arabic numerals and symbols.
For example, lima puluh persen, satu per tiga, lima koma satu...
f. When it comes to the ID card number, phone number or zip code, transcribe the way the
speaker reads it.
Add a space between the text numbers and word letters and transcribe them into Indonesian
numbers according to the way the speaker reads.
For example: sky123@gmail.com transcribed as s k y stu dua tiga et gmail dot com.
4.Units
For international units such as kilometers, kilograms, square kilometers, etc., you can directly
transcribe them into corresponding words according to the pronunciation.
✔kilo, kilometro, meter persegi
❌not abbreviation: km, kg, m²
5.Gender Judgment
Categories: Male, Female, Children, Unknown.
When the gender of the actual speaker does not match the gender of the voice, judge it by the voice
heard.
D. FAQ
1. Stuttering Word
It needs to be transcribed according to the actual pronunciation.
For example: se se sebenarnya.
3. Acronyms
a.If the pronunciation is the full name of the acronym, the acronym CANNOT be transcribed in
its short form, such as dan lain-lain CANNOT be abbreviated as dll.
b.If an acronym is read the way it is, then directly spell it as the corresponding abbreviation, such
as KTP, HP (mobile phone), AC (air conditioner), and DO NOT transcribe them as katepe, hape,
atse.
c. DO NOT abbreviate the week, month and address.
d. The Pronunciation of Indonesian Letters:
A[a:] B[bé] C[tsé] D[dé] E[é] F[éf] G[gé] H[ha] I[i] J[jé] K[ka] L[él] M[ém] N[én] O[o] P[pé]
Q[ki] R[ér] S[és] T[té] U[u] V[fé] W[wé] X[éks] Y[yé] Z[zé]
5. Modal
The modal particles can be transcribed according to the pronunciation, such as ih, nih, loh, tuh, eh,
ya, but if the whole segment is full of modal particles, do not transcribe it.
[Note: The modal particle words sih, nih, loh, tuh, eh CANNOT be transcribed as si, ni, lo, tu,e if
which would be judged as error by the reviewer.]
For example: There are a total of 10 words, but in the middle there are less than 3 Indonesian
words (including 3). In that case, there is no need to segment and transcribe, select "Cannot label".
3. If there is a large number of Indonesian words in the middle of the dialect, it can be
transcribed.
For example: There are more than 5 (including 5) out of 10 Indonesian words in a segment that we
can find on KBBI. In that case, it should be segmented and transcribed.
4. If there's an Indonesian word but the speaker use their local “dialect” or “pronounciation”
but we can still hear it clearly if that an Indonesian word and we can find it on KBBI, then it
should be transcribed. But if the pronunciation is not clear, then we do not need to transcribe
it.
9. Incomplete word
If there is an incomplete word at the beginning or the end of a sentence, discard this word, neither
segment nor transcribe.
10. Spelling
When spelling, if the speaker spells it letter by letter, and separate the letters with spaces.
For example: If the speaker reads it as "a p p l y", transcribed it as "a p p l y".
11. Overlapping & Interjection
When the audio overlaps, NO segmentation and transcription should be performed.
⭐In addition, it is necessary to clarify the difference between "overlapping" and "interjection".
If two people are talking at the same time, then this situation is determined as "overlapping", and
NO segmentation and transcription will be performed.
But if one of them is interjecting, during this process , if one person's voice can be clearly heard, but
the other party does not speak at the same time, this is regarded as "interjection" and can be
segmented and transcribed normally.
12. Unison
When multiple people say the same sentence in unison, they can be transcribed normally.
In this case, if the gender of the speakers is the same, select the corresponding gender; if the gender
is different, select "Unknown".
13. Laughter
Real laughter DOES NOT need to be transcribed, but if the laughter was made deliberately, and
the number of laughs can be determined from 1 to 3 in the sense of hearing, that is, ha, haha or
hahaha, it can be transcribed normally.
Longer ones or those whose number is not audibly determinable can be cut off directly and
DO NOT transcribe them.
[Example A: 1 to 3 ha should be reserved.]
Video ini lucu bangat, hahaha, saya mau nonton lagi.
Segmented and transcribed as: video ini lucu bangat, hahaha, saya mau nonton lagi.
[Example B: More than 3 ha, or even longer, neither segment nor transcribe.]
Video ini lucu bangat, hahahahahahaha, saya mau nonton lagi.
Segment 1 should be transcribed as: video ini lucu bangat.
Sgement 2 should be transcribed as: saya mau nonton lagi.
16. Song
When there is only audio but no video reference, if the song and another human voice appear at the
same time, refer to "Segmentation 4" for labelling;
If only the song appears, it can be segmented and transcribed normally because it is impossible to
judge whether it is background music or lip-sync.
1. The First Principle - Ensure the completeness of the sentence and its semantics, try not
to divide a complete sentence into pieces.
If more than one speaker (≥2) are talking, separate every speaker.
In some special cases, if one speaker's words is less than 2, it can be segmented together with
another speaker. Still, if the gender of the speakers is the same, select the corresponding gender; if
the gender is different, select "Unknown".
[Example 1]
A: mau ke mal nggak?
B: nggak mau,
A: siap
Segment 1 should be transcribed as: mau ke mal nggak?
Segment 2 should be transcribed as: nggak mau. siap.
[Example 2]
A: mau ke mal nggak?
B: nggak.
A: jadi mau ke mana?
B: restoran.
Segment 1 should be transcribed as: mau ke mal nggak? nggak.
Segment 2 should be transcribed as: jadi mau ke mana? restoran.
2. The segment must be a valid one with more than two words, and the longest duration is
not limited, but try to follow the first principle - try not to segment a complete sentence
separately.
3. Within each segment, any sound that is longer than 2 seconds, including silences, is not
allowed, except for human voices.
4. There can only be one main speaker's voice at a time inside each segment.
If there is background sound (other people's talking or singing) ---
If it doesn't affect the main voice, it can be ignored and segment the main voice as normal;
if it’s too loud - the data is deemed invalid, DO NOT segment or transcribed.
The background sound/noise obviously interfered with the main speaker. That is, the speaker
cannot be heard, so the background sound is regarded as loud and is not transcribed.
If the speaker's words can be heard clearly regardless of the background sound/noise, then it
should be transcribed normally.
[Note: If the background sound or noise is human voice and the specific content can be heard
clearly, it will be regarded as "overlapping" and will be NEITHER segmented NOR transcribed.]
Situations where normal segmentation and transcribing are required
■ There is very clear music and singing in the background, but the speaker can be heard clearly
anyway, then transcribe normally;
■ There are loud noises in the background (such as on the side of the road or in a bar), but the
speaker can be heard clearly, then transcribe normally;
■ The background sound has special effects (such as laughter) to heighten the atmosphere, but the
speaker can be heard clearly, then transcribe normally;
■ When the main speaker speaks, if one of the parts speaks in a low voice or can't be heard at
all, the content of the main speaker's words should be transcribed normally and we should
determine the gender of the voice according to the main speaker.
Situations where the content should be discarded
■ The background sound overpowers the speaker's voice and we can't make sure what is said,
we need to discard it;
■ Two people are talking at the same time, and both voices can be heard clearly. For example,
when they are arguing, the clip needs to be discarded.
5. ⭐⭐⭐Leave a space between 0.2 and 0.5 seconds before and after each segment (the
space can be any sound, including silence, as long as it's not obvious vocals).
6. Segment and transcribe voice changer, TV broadcast, navigation and other sounds
normally.
7. Segment and transcribe the narration normally.
Invalid data
1. If the entire audio is in an incomprehensible dialect, silence or repeating the same
letters, such as "ya ya ya" and "hehehe", the fragment is invalid, select "Cannot label".
2. If no one speaks in the entire audio or the speaker is always repeating a single word, the
whole fragment is invalid, select "Cannot label".
3. If the singing or speech is severely slurred (including those spoken by infants who are just
learning to speak and cannot speak clearly), the accent of speaker is too heavy or the
pronunciation is blurred, the frame is severely dropped in the video or the stuttering leads to
incomprehension, the whole fragment is invalid, select "Cannot label".
4. If the vocal part of the fragment is less than two words, the whole fragment is invalid,
select "Cannot label".