You are on page 1of 9

Indonesian_Lyric ASR_Guidelines​

Please also refer to the QA ​ Indonesian_ASR_Q&A ​​


Platform manaul: ​ Indonesian Transcription Rules ​​
I. Overall Rules​
1. Just label the text corresponding to the Indonesian pronunciation, according to the actual
pronunciation (applicable to slang, numbers, symbols, web addresses, email, etc.) ​
2. Non-indonesian language sounds, uncertain Indonesian language sounds, inaudible or
unintelligible Indonesian language fragments, not segmented and not transcribed.​
3. If phonetic material appears in whole sentences in regional dialects such as Javanese and
Sunda, no need to transcribe. However, if there are some English or a small number of
dialect/network words in the understandable semantic range of single sentence, need to
transcribe.​
4. No need to correct grammar, just keep the pronunciation right.​
5. You can refer to the video subtitles to assist the transcription. BUT when the subtitle is not
consistent with video, PLS FOCUS ON AUDIO CONTENT, transcribe as what you hear in
video.​
6. You can refer to the lyrics in Google or your favorite music app or web.​
7. The speech part in segment should also be kept and transcribed if it’s clear even though
it's a lyric project. That's to say, if the audio only contains clear speech instead of lyrics, it
also needs transcription.​
8. If the speech and song part are overlapped, hard to transcribe, pls cut the overlapped part
out and needn’t transcribe it. For example, if the audio contains speech1&song
part1(overlapped) and alone song part2, only have to keep the clear song part2.​
9. Transcribe the content as the segment is, no matter whether it is sung by one person or
song records. We can search the singing part in local music webs.​
II. Transcription Rules​
1. Punctuation
a. Type: comma, full stop, question mark, exclamation mark, hyphen​
b. Punctuation uses English half Angle​
c. Punctuation needs to be closed to the preceding word, and a blank is needed to
separate the following word. For example: Kamu menang, kamu kalah. ​
d. If it is necessary to break a sentence, such as stopping after finishing a sentence, there
must be punctuation.​
2. Case sensitive​
a. Don't captialize the first word in the sentence.​
b. Proper nouns such as people's name, place name, book titles and movies are in
lowercase. Special abbreviations follow the custom, such as AC (air conditioning), KTP
(identity card), DPRD (Indonesian People's Congress), need to be in uppercase.​
3. Hyphen needed situation​
When two words are added together to form a new word, they mean something else​
Example: tiba (arrive),tiba-tiba (suddenly)​
4. Figure/ Number​
Generally speaking, there are two main ways of reading numbers. The first is as a
numberical value (123 as one hundred and twenty-three), and the second is as a numberical
code (123 as one, two, three)​
a. It is transcribed directly into the Indonesian numbers, numberical value 123 as seratus
dua puluh tiga, numberical code 1234 as satu dua tiga empat. Space is needed in each
word.​
b. If the Indonesian numeral pronunciation is swallowed, there is no need to follow the
original standard language to transcribe. For example, nam no need to transcribe as
enam, lapan no need to transcribe as delapan​
c. "no" or "nomer" both are acceptable when it expresses the first, second..​
d. decimal point: “,” ( comma, if read as koma, then write koma)​
separator: “.” ( full stop, if read as titik, then write titik)​
5. Encountered id card, telephone, postcode, transcribe what you hear, text numbers and
letters need to add space, transcribed into Indonesian numbers. For example:
sky123@gmail.com as s k y satu dua tiga et gmail dot com​
6. Unit​
For the international unit, transcribe into the word according to the pronunciation
kilo, kilometer​
Don't use km or kg.
7. Gender Judgment​
a. Type: male, female, child, unkown ​
b. If the speaker uses voice changer, it depends on the audio clip. That is, the voice after the
use of the voice changer is a male voice is marked as male; Female voices are labeled
female; If it is difficult to distinguish, it is labeled unknown.​

8. Incomplete Word at the Beginning or in the End​


Incomplete LYRIC word at the beginning/end of the sentence: if you can guess the
whole word according to the lyric reference in web or app, pls transcribe the
complete word(you can lengthen a little to contain the word); if cannot, pls discard
the word directly.​
◦ PLS TRY YOUR BEST TO GUESS AND TRANSCRIBE THE LYRIC WORD AS YOU CAN. If
really difficult to find the complete word, you can then discard.​
**BUT incomplete speech word at the beginning or in the end still have to be
discarded.​
III. FAQ​
1. Stammering words​
Stammering words should be labeled according to the actual pronunciation​
Example: se se sebenarnya ​
2. Slang/ Internet words​
Common oral pronunciation swallowing or adding syllables, can be transcribed directly
according to the pronunciation​
Example: "nam/itam/ilang/mamae/mantul" no need to transcribed as
"enam/hitam/hilang/mama/mantap betul"​
3. Abbreviation​
a. If the pronunciation is the full name of an abbreviation, do not write an abbreviation.​
Eg: "dan lain-lain" can't be abbreviated to dll​
b. If the pronunciation is an abbreviation, the corresponding abbreviation is directly
transcribed according to the pronunciation.​
Eg: KTP,HP(phone),AC,can't be written as katepe,hape,atse​
c. Week, month, address cannot be abbreviated.​
d. Indonesian alphabet pronunciation reference:​
A [a:] B[bé] C[tsé] D[dé] E [é] F[éf] G[gé] H [ha] I [i] J [jé] K[ka] L [él] M[ém] N[én] O
[o] P[pé] Q [ki] R [ér] S[és] T[té] U[u] V[fé] W[wé] X[éks] Y [yé] Z[zé]​
4. A few idioms and phonetic translation​
If the sound is the same, than both are accceptable.​
Eg: santai, "i" turns into "y" ​
If the sound is different, transcribe as the original pronunciation​
Eg: sama-sama→ sami-sami, "a" turns into "i"​
5. Modal words​
Modal words don't need to treat specially, but if there are, then transcribe according to the
pronunciation, such as ih, nih, loh, tuh,eh,ya​
Notice: sih, nih, loh, tuh,eh cannot be written as si, ni, lo, tu,e​
6. Foreign loanword or brand name​
Transcribe according to the actual pronunciation, no need to transcribe the original English
spelling.​
Eg: xiaomi → xiomi​
7. Don't write nonexistent words according to the pronunciation​
After the fragment labeling is completed, it is necessary to check whether the annotation
text has actual semantics.​
8. Prefixes and suffixes​
Prefixes and suffixes need to follow the grammar.​
Eg: "di" use with a verb, passive, need to write with the verb, such as "digunakan "​
"di" use with a place, means "at", a space is needed, such as "di mana, di sekolah"​
9. Overlap​
Don't transcribe overlap audio.​
Overlap and Interrupt:​
If two people are talking at the same time, the situation is considered "overlap" and do not
label. However, if one person interrupts, one person's voice can be clearly heard during the
process, but the other person does not speak at the same time, this is regarded as
"interrupting" and can be labeled normally​
10. Many people say the same thing in unison​
Many people say the same thing in unison can be labeled.​
In this case, if the speakers are of the same gender, the corresponding gender can be
selected; If the gender is different, select Unknown​
11. Incomplete word at the beginning/end of a sentence​
discard the word​
12. Laughter: real laughter​
Real laughters do not need to label, but more clearly deliberately send out 1 to 3 ha, and in
the sense of listening can determine the number of 1 to 3, that is, ha, haha or hahaha, can
be written. Longer or unquantifiable items are cut off.​
Eg1: 1-3 need to transcribe​
A: video ini lucu bangat, hahaha, saya mau nonton lagi. ​
transcribe as:video ini lucu bangat, hahaha, saya mau nonton lagi.​
Eg2:More than 3, cut off: video ini lucu bangat, hahahahaha, saya mau nonton lagi.​
transcribe in two parts, cut the laughter​
part1: video ini lucu bangat,​
part2: saya mau nonton lagi.​
13. Cannot hear clearly​
If there are more than two inaudible words in a paragraph, discard. If there is a word
inaudible, and the word seriously affects the semantics of the sentence, also discard.​
The following is a way to determine if the word seriously affects the semantics:​
a. If you cut the word and the remaining two parts of the sentence have meaning, cut the
word separately and label the remaining two paragraphs. However, the two sections still
have to meet the minimum fragment length limit​
b. If you cut out the word, the rest of the sentence will be meaningless, which means that
the word will seriously affect the semantics. Therefore, the whole paragraph discard.​
14. Punctuation​
Punctuation must be added according to the meaning of the sentence.​
If there is a break in the paragraph, such as a CLEAR pause at the end of a sentence,
punctuation is needed.​

IV. Valid data​


1. Ensure that the sentence is complete semantic, complete
sentence should be as far as possible not to segment.​
2. Fragmentation time requirements​
◦ 2.1 There's not limit in segment length now, you can segment as long
as you can(make sure the quality)! But the segment should contain at
least 2 words(≥2 words)!! And pls also follow rule1: try not to cut the
complete sentence​
Suggestion: try your best to cut them into 6~8 seconds segments.
Combine long sentences can be cut according to the comma.​
◦ 2.2 DO NOT Miss Word or Letter in the transcription.​
◦ 2.3 Special situations (such as characters in continuous dialogue with simple responses)
are not fixed 6-8s, such as verbal affirmations "ya, ya, mantap betul " or "nah itu, itu", are
treated as normal short sentence by cutting and transcribing. ​
◦ 2.4 If two or more people are talking, cut each person separately. One of them speaks for
less than 6s, but the content is more than 2 words, cut and transcribe normally. If one
person plays the role of more than one person and carries on the dialogue, no
matter the tone is different or not, it should be cut according to the situation of the
dialogue, according to the role.​
On this basis, if there is a special case, that is, when one party speaks, the number of
words is less than 2, the speech fragments of two people can be divided together. In
this case, if the speakers are of the same gender, the corresponding gender can be
selected; If the gender is different, select Unknown.​
Eg 1:​
A: mau ke mal nggak?​
B: nggak mau, ​
A: siap​
transcribe as​
Part1: mau ke mal nggak? ​
Part2: nggak mau. siap.​
Eg 2:​
A: mau ke mal nggak?​
B: nggak.​
A: jadi mau ke mana?​
B: restoran.​
transcribe as​
Part1: mau ke mal nggak? nggak.​
Part2: jadi mau ke mana? restoran.​
3. We must ensure that the word at the end of the last sentence and
the word at the beginning of the next sentence remain intact and
cannot cut phoneme.​
4. Each segment can not have more than 2 seconds of continuous sound except
obvious human voice or lyric, including mute.​
5. There can only be one main speaker voice within each segment. If there is background
sound, it is not obvious and can be ignored - normal cut; Obvious - The data is invalid​
(Background sound: Others speak/ sing)​
If the background sound or noise interferes the main speaker's voice a lot, no need to
label. If the does not affect the main speaker's voice, for example there's human voice
but can't be heard clearly, label it as usual.​

CAN cut and transcribe:​


There are clear music and songs in the background sound, but the main speaker's voice or
lyric can be heard clearly too, transcribe.​
There is noisy noise in the background ( like in the street or in a bar/club), but the main
speaker's voice or lyric can be heard clearly, transcribe.​
There are special effects sound in the background ( like laughter), but the main speaker's
voice or lyric can be heard clearly, transcribe.​
When the main speaker speaks, if the other person speaks in a very low voice, or can not be
heard clearly, transcribe what the main speaker says.​

CANNOT transcribe:​
If the main speaker's voice was drowned out by the background sound, or can not be heard
clearly, invalid.​
If two speakers speak together and both of them can be heard clearly, like interruption or
argument, invalid.​
6. Before and after each segment, 0.2-0.5 seconds of white space should be reserved (all
sounds except obvious human voices, including mute).​
7. Sound changer, TV broadcast, navigation and other sound normal cut and label.​
8. Narration videos can be cut and transcribed normally.​
9. Talk show, live character performance (original) : cut by character, character A, character B...
, the narrator. The content of the character should meet the minimum length limit of cut
(≥2 words). If not, refer to "2.4" for handling.​
V. Invalid data​
1. If the whole audio is an unintelligible dialect/no one speaks/continuous single word, such
as “ya ya ya” and “hehehehehe” , then invalid.​
2. Mute or live singing contains severe slurring of singing/speaking (including the words of
infants who have just learned to speak and cannot articulate the words clearly), heavy
accent/ blurred pronunciation (not sure if the words are written correctly), lost frames,
which make the segment can not be understand, invalid.​
3. Through second use, suspected of having been used many times of audio or video,
invalid. Don't label the case as ​
there have been many times:​
like lip-synching, BGM song lyrics, choose can not label​
4. Part of the suspected of having been used many times:​
For the co-shot video, it is suspected that the used part has been for many times, do
not label. The part of the photographer's real voice needs to be properly label.​
5. The content of the segment is less than two words (<2) , the segment is invalid.​

You might also like