You are on page 1of 9

Guidelines for Indonesian Subtitle ASR_0921

A. Basic introduction to Indonesian


Indonesian language (Bahasa Indonesia) is the official language of the Republic of Indonesia and is
widely spoken throughout the Indonesian archipelago. In linguistic classification, Indonesian,
Malay, Javanese, etc. together constitute the West Indonesian branch of the Malay-Polynesian
language family.
Indonesian is a Latin alphabetic language with spaces as word boundaries, a defined syllable
structure, and a one-to-one mapping of the pronunciation of each letter in a syllable to the written
form.
Indonesian consists of 5 vowels, 3 diphthongs and 25 consonants (the single vowel e and é are not
distinguished at this time), and it is a non-tonal language. It is an agglutinative language, and new
words are formed in three ways: attaching affixes (affixes are divided into prefixes, infixes, and
suffixes) to the root, repeating several words or parts of words to form new words, and loanwords.
Grammatical gender is not used in Indonesian, only natural gender is used. There is no grammatical
plural. Verb conjugations do not involve person, number, or tense. The representation of time uses
adverbs of time or other tense indicators (such as sudah - "has" and belum - "has not yet").

B. Overall Rule
1. Just transcribe the text corresponding to the Indonesian voice, and transcribe it
according to the actual pronunciation (applicable to slang, numbers, symbols, URLs, e-
mail addresses, etc.).
2. For Non-Indonesian speech, unsure whether or not Indonesian speech, inaudible or
incomprehensible Indonesian fragments, it would be NEITHER segmented NOR
transcribed.
3. If the entire fragment is the local dialect like Javanese and Sudanese, there is NO need to
segment or transcribe; but if there are a small number of English words, dialect or
internet slang within the semantically comprehensible range of a single sentence, then it
needs to be transcribed.
4. There is NO need to correct grammar mistakes, just try to transcribe the actual
pronunciation.

C. Transcribe Rules
1. Punctuation Marks
a. Overall Rules

Add punctuation marks according to the meaning of sentence. If there is a pause after one clause,
there must be a punctuation mark.
[Note: Half-width format is required when transcribing, so please use the symbols of the
Indonesian input method.]
b. Category

Transcribe commas, periods, question marks, exclamation marks and hyphens, others DO NOT
need to be transcribed.

c. Hyphen (-)

 When two words are superposed to form a compound word, it will express a new
meaning. A hyphen "-" is required.

For example: tiba (to arrive), tiba-tiba(suddenly).

 Particle "Lah". In some cases, when the words before "lah" are names, countries, cities,
English words, hypen "-" must be added when writing.

For example: Pak Ahmad-lah, Bebek-lah, Amerika-lah.


If the preceding words are not those words, they need to be transcribed according to KBBI
rules.
For example: buanglah, gapailah, railah, sapulah.

d. Space

The punctuation mark immediately follows the preceding word and is separated from the following
word by a space.
For example: "Kamu menang, kaku kalah."

2. Upper and Lower Case ⭐⭐⭐


a. There is NO need to capitalize the first letter of a sentence.
b. There is NO need to capitalize the first letter of proper nouns (person names, place names,
organization names); the same applies to festivals, subjects, historical events, book titles, movie
names, TV program names, and song titles.
c. For proper nouns consisting of multiple words, there is NO need to capitalize the first letter
of each word as well.
d. ⭐If a special abbreviation or acronym is normally capitalized in Indonesian, such as AC,
KTP and DPRD, then the word needs to be capitalized while transcribing.

3.Numbers
Indonesian Arabic Number
Number
kosong / nol 0
satu 1
dua 2
tiga 3
empat 4
lima 5
enam 6
tujuh 7
delapan 8
sembilan 9
Generally speaking, there're two main ways to read numbers.
The first one is read as numerical value, such as 123 is read as one hundred and twenty-three.
The second one is read as a number, such as 123 is read as one two three.

a. The numbers are directly transcribed into the corresponding Indonesian pronunciation.
Space needs to be added between Indonesian numerals.
For example, 123 read as numerical value should be transcribed as seratus dua puluh tiga; 1234
read as a number should be transcribed as as stu dua tiga empat.

b. If the pronunciation of the numbers in Indonesian is swallowed, it does not need to be


transcribed to the original form.
Such as, nam does not need to be transcribed to enam; lapan does not need to be transcribed to
delapan.

c. It can be transcribed as "no." or "nomer" when it comes to the ordinal numbers.

d. Use "," for decimal point (comma,if it is read “koma”, transcribed as koma directly).
Use "." for quantile (period, if it is read “titik”, transcribed as titik directly).

e. Percentages, fractions, decimals (such as 50%, 1/3, 5.1) are transcribed into corresponding
words instead of Arabic numerals and symbols.
For example, lima puluh persen, satu per tiga, lima koma satu...

f. When it comes to the ID card number, phone number or zip code, transcribe the way the
speaker reads it.
Add a space between the text numbers and word letters and transcribe them into Indonesian
numbers according to the way the speaker reads.
For example: sky123@gmail.com transcribed as s k y stu dua tiga et gmail dot com.

4.Units
For international units such as kilometers, kilograms, square kilometers, etc., you can directly
transcribe them into corresponding words according to the pronunciation.
✔kilo, kilometro, meter persegi
❌not abbreviation: km, kg, m²

5.Gender Judgment
Categories: Male, Female, Children, Unknown.
When the gender of the actual speaker does not match the gender of the voice, judge it by the voice
heard.
D. FAQ

1. Stuttering Word
It needs to be transcribed according to the actual pronunciation.
For example: se se sebenarnya.

2. Internet Slangs & Terms


The swallowing or adding sounds in common spoken language such as internet slang and terms
can be transcribed directly according to the pronunciation.
For example: There is NO need to transcribe nam/itam/ilang/mamae/mantul/lapan as
enam/hitam/hilang/mama/mantap betul/delapan.
If the speaker has an accent or uses a special pronunciation of internet slang, it should be
transcribed into correct spelling to ensure semantics EXCEPT for dialects that are understood by
local people. It can be transcribed according to the pronunciation.
For example: mantul, itam, ilang.

3. Acronyms
a.If the pronunciation is the full name of the acronym, the acronym CANNOT be transcribed in
its short form, such as dan lain-lain CANNOT be abbreviated as dll.
b.If an acronym is read the way it is, then directly spell it as the corresponding abbreviation, such
as KTP, HP (mobile phone), AC (air conditioner), and DO NOT transcribe them as katepe, hape,
atse.
c. DO NOT abbreviate the week, month and address.
d. The Pronunciation of Indonesian Letters:
A[a:] B[bé] C[tsé] D[dé] E[é] F[éf] G[gé] H[ha] I[i] J[jé] K[ka] L[él] M[ém] N[én] O[o] P[pé]
Q[ki] R[ér] S[és] T[té] U[u] V[fé] W[wé] X[éks] Y[yé] Z[zé]

4. Idiomatic Expression & Phonetic Transcription


When transcribing, as for the pronunciation, it is correct to write both ways.
For example: santai→santay, from i to y.
If the pronunciation is different, it can be directly transcribed as the original pronunciation.
For example: sama-sama→sami-sami, from a to i.

5. Modal
The modal particles can be transcribed according to the pronunciation, such as ih, nih, loh, tuh, eh,
ya, but if the whole segment is full of modal particles, do not transcribe it.
[Note: The modal particle words sih, nih, loh, tuh, eh CANNOT be transcribed as si, ni, lo, tu,e if
which would be judged as error by the reviewer.]

6. Loanword & Brand Name


According to the actual pronunciation translation, there is NO need to transcribe it to the
original English spelling, such as xiaomi → Xiomi. DO NOT use wrong words with similar
pronunciation.
7. Dialect
1. All dialects or local languages DO NOT need to be transcribed.
2. If there's a small number of Indonesian words in the middle of the dialect, do not need to
transcribe it.

For example: There are a total of 10 words, but in the middle there are less than 3 Indonesian
words (including 3). In that case, there is no need to segment and transcribe, select "Cannot label".

3. If there is a large number of Indonesian words in the middle of the dialect, it can be
transcribed.

For example: There are more than 5 (including 5) out of 10 Indonesian words in a segment that we
can find on KBBI. In that case, it should be segmented and transcribed.

4. If there's an Indonesian word but the speaker use their local “dialect” or “pronounciation”
but we can still hear it clearly if that an Indonesian word and we can find it on KBBI, then it
should be transcribed. But if the pronunciation is not clear, then we do not need to transcribe
it.

8. Prefix & Suffix


Indonesian grammar rules should be followed while transcribing.
[Note for di-]
When di- used in the passive with verb, do not add space after di, such as digunakan;
When di- used with location meaning “at”, add a space after di, such as di mana, di sekolah.
[Note for Kau- and Ku-]
Kau- can be written connected with the word after if it is a verb that can be a passive word. Ex:
kaubaca (dibaca), kautulis (ditulis), kaubelikan (dibelikan),etc. moreover we can check if the
word can be written connected or not using this website https://id.wiktionary.org/wiki/beli (just
type the word that you confused of into the search engine in this website, for example you type
“kaumakan” on the search engine, if theres a explanation or the meaning of the word than you can
written it connected)
ku- most of the time it needs to be written connected with the word after, but the word after needs
to be a verb and passive word. Moreover, we need to see if the word in this sentence works as a
verb or not, because in Indonesian words, some words can act as a verb and adjtv at the same time,
so we need to pay attention to the use of this word. We can check it at
https://id.wiktionary.org/wiki/Wiktionary:Halaman_Utama
https://beritagar.id/artikel/tabik/kau-harus-kusambung

9. Incomplete word
If there is an incomplete word at the beginning or the end of a sentence, discard this word, neither
segment nor transcribe.

10. Spelling
When spelling, if the speaker spells it letter by letter, and separate the letters with spaces.
For example: If the speaker reads it as "a p p l y", transcribed it as "a p p l y".
11. Overlapping & Interjection
When the audio overlaps, NO segmentation and transcription should be performed.
⭐In addition, it is necessary to clarify the difference between "overlapping" and "interjection".
If two people are talking at the same time, then this situation is determined as "overlapping", and
NO segmentation and transcription will be performed.
But if one of them is interjecting, during this process , if one person's voice can be clearly heard, but
the other party does not speak at the same time, this is regarded as "interjection" and can be
segmented and transcribed normally.

12. Unison
When multiple people say the same sentence in unison, they can be transcribed normally.
In this case, if the gender of the speakers is the same, select the corresponding gender; if the gender
is different, select "Unknown".

13. Laughter
Real laughter DOES NOT need to be transcribed, but if the laughter was made deliberately, and
the number of laughs can be determined from 1 to 3 in the sense of hearing, that is, ha, haha or
hahaha, it can be transcribed normally.
Longer ones or those whose number is not audibly determinable can be cut off directly and
DO NOT transcribe them.
[Example A: 1 to 3 ha should be reserved.]
Video ini lucu bangat, hahaha, saya mau nonton lagi.
Segmented and transcribed as: video ini lucu bangat, hahaha, saya mau nonton lagi.
[Example B: More than 3 ha, or even longer, neither segment nor transcribe.]
Video ini lucu bangat, hahahahahahaha, saya mau nonton lagi.
Segment 1 should be transcribed as: video ini lucu bangat.
Sgement 2 should be transcribed as: saya mau nonton lagi.

14. Inaudible ⭐⭐⭐


If there are more than two (>2) inaudible words in a segment, discard and do not transcribe
them. If a word CANNOT be heard clearly, and the word seriously affects the semantics of the
sentence, it will also be discarded and not transcribed.
Note: The method for judging whether a word seriously affects semantics is as follows:
If the word is cut off, and the remaining two parts of the sentence have semantics, then the word
should be left out separately, and the remaining two parts should be transcribed.
However, these two parts still need to meet the limit of the minimum segment length, that is, "if
the segment contains less than two words (<2), then the fragment is invalid and can not be
segmented”, if it does not match the criteria, it will be NEITHER segmented NOR transcribed.
If the word is cut off and the rest of the sentence doesn’t make sense, it means that the word will
seriously affect the semantics, so the whole paragraph should be omitted. And there is NO need to
transcribe it.
15. Talk Show & Role Play
Divide by role, role A, role B...,and narration. What the character says should meet the minimum
length limit(number of words ≥ 2). If not, please refer to "Segmentation 1".

16. Song
When there is only audio but no video reference, if the song and another human voice appear at the
same time, refer to "Segmentation 4" for labelling;
If only the song appears, it can be segmented and transcribed normally because it is impossible to
judge whether it is background music or lip-sync.

General Rules for ASR Transcription


Valid data
Segmentation
(Short Video Stream Segmentation Rules can be found at the end of the text)

1. The First Principle - Ensure the completeness of the sentence and its semantics, try not
to divide a complete sentence into pieces.

If more than one speaker (≥2) are talking, separate every speaker.
In some special cases, if one speaker's words is less than 2, it can be segmented together with
another speaker. Still, if the gender of the speakers is the same, select the corresponding gender; if
the gender is different, select "Unknown".
[Example 1]
A: mau ke mal nggak?
B: nggak mau,
A: siap
Segment 1 should be transcribed as: mau ke mal nggak?
Segment 2 should be transcribed as: nggak mau. siap.
[Example 2]
A: mau ke mal nggak?
B: nggak.
A: jadi mau ke mana?
B: restoran.
Segment 1 should be transcribed as: mau ke mal nggak? nggak.
Segment 2 should be transcribed as: jadi mau ke mana? restoran.

2. The segment must be a valid one with more than two words, and the longest duration is
not limited, but try to follow the first principle - try not to segment a complete sentence
separately.
3. Within each segment, any sound that is longer than 2 seconds, including silences, is not
allowed, except for human voices.
4. There can only be one main speaker's voice at a time inside each segment.
If there is background sound (other people's talking or singing) ---
If it doesn't affect the main voice, it can be ignored and segment the main voice as normal;
if it’s too loud - the data is deemed invalid, DO NOT segment or transcribed.

 Judging whether it is too loud or not:

The background sound/noise obviously interfered with the main speaker. That is, the speaker
cannot be heard, so the background sound is regarded as loud and is not transcribed.
If the speaker's words can be heard clearly regardless of the background sound/noise, then it
should be transcribed normally.
[Note: If the background sound or noise is human voice and the specific content can be heard
clearly, it will be regarded as "overlapping" and will be NEITHER segmented NOR transcribed.]
Situations where normal segmentation and transcribing are required
■ There is very clear music and singing in the background, but the speaker can be heard clearly
anyway, then transcribe normally;
■ There are loud noises in the background (such as on the side of the road or in a bar), but the
speaker can be heard clearly, then transcribe normally;
■ The background sound has special effects (such as laughter) to heighten the atmosphere, but the
speaker can be heard clearly, then transcribe normally;
■ When the main speaker speaks, if one of the parts speaks in a low voice or can't be heard at
all, the content of the main speaker's words should be transcribed normally and we should
determine the gender of the voice according to the main speaker.
Situations where the content should be discarded
■ The background sound overpowers the speaker's voice and we can't make sure what is said,
we need to discard it;
■ Two people are talking at the same time, and both voices can be heard clearly. For example,
when they are arguing, the clip needs to be discarded.

5. ⭐⭐⭐Leave a space between 0.2 and 0.5 seconds before and after each segment (the
space can be any sound, including silence, as long as it's not obvious vocals).

[Note for Reviewer]


If it happened that labeler only give like 0.19s or 0.18s white space, there is don’t need to marked
as wrong. If there's another speaker before the segment starts, then we don't need to give 0.2 to 0.5s
space, but if there's still a space between (ex: 0.1s ) we still need to give some white space.
PS: We MUST leave a gap between two segments.
We know when many people speak in turn quicky(not overlap), for each one's speech, if no
more than one word, they can segment in one; but if over than one word, we have to do
segments separately, and in this situation, leave only 0.01s, 0.015s, 0.005s, a very short gap is a
MUST.

6. Segment and transcribe voice changer, TV broadcast, navigation and other sounds
normally.
7. Segment and transcribe the narration normally.

Invalid data
1. If the entire audio is in an incomprehensible dialect, silence or repeating the same
letters, such as "ya ya ya" and "hehehe", the fragment is invalid, select "Cannot label".
2. If no one speaks in the entire audio or the speaker is always repeating a single word, the
whole fragment is invalid, select "Cannot label".
3. If the singing or speech is severely slurred (including those spoken by infants who are just
learning to speak and cannot speak clearly), the accent of speaker is too heavy or the
pronunciation is blurred, the frame is severely dropped in the video or the stuttering leads to
incomprehension, the whole fragment is invalid, select "Cannot label".
4. If the vocal part of the fragment is less than two words, the whole fragment is invalid,
select "Cannot label".

You might also like