0% found this document useful (0 votes)
37 views4 pages

Data Correction Tasks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views4 pages

Data Correction Tasks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Validation (Per Speaker):

The purpose of this phase is to verify:

● Speaker Dialect: It should be Saudi; additionally, assistance is needed to identify the


specific Saudi dialect.
● Speaker Pronunciation.
● Tempo: Assessing the speaker's pace, determining if it's within the normal range or
unusually fast/slow.

The labeler will be provided with a few extended samples (10 - 20 seconds) or a YouTube link

to verify the speaker's voice.

Transcript Correction (Per Sample):

Correct transcription errors. Given an audio sample and its corresponding transcription,

correct any mistakes as accurately as possible.

● Add punctuations when appropriate. {, ! ? .} are enough for this phase.


● ‫ وﺗ ُﺳﺗﺧدم ﻓﻲ اﻟﻣواﺿﻊ‬،‫ﺿﺎ )اﻟﺷوﻟﺔ( وﺗرﻣز اﻟﻔﺎﺻﻠﺔ إﻟﻰ وﻗﻔﺔ ﻗﺻﯾرة ﻋﻧد اﻟﻘراءة‬
ً ‫( وﺗﺳﻣﻰ أﯾ‬،) ‫ﺗ ُﻛﺗب اﻟﻔﺎﺻﻠﺔ ﺑﮭذا اﻟﺷﻛل‬
‫اﻟﺗﺎﻟﯾﺔ‬:
○ ‫ وﯾﺣرص ﻋﻠﻰ اﻻﻟﺗزام‬،‫ اﻟطﺎﻟب اﻟﻣﺟﺗﮭد ﯾﮭﺗم ﺑدروﺳﮫ‬:‫ ﻣﺛل‬،‫ﺑﯾن اﻟﺟﻣل اﻟﻣﺗﺻﻠﺔ اﻟﻣﻌﻧﻰ‬.
○ ‫ ﻣﺎرس‬،‫ ﻓﺑراﯾر‬،‫ ﯾﻧﺎﯾر‬:‫ ﻣن ﺷﮭور اﻟﺳﻧﺔ اﻟﻣﯾﻼدﯾﺔ‬:‫ ﻣﺛل‬،‫ﺑﯾن أﻗﺳﺎم اﻟﺷﻲء اﻟواﺣد‬.
○ ‫ اﻧﺗﺑﮫ إﻟﻰ دروﺳك‬،‫ ﯾﺎ ﻣﺣﻣد‬:‫ﺑﻌد ﻟﻔظ اﻟُﻣﻧﺎدَى ﻣﺛل‬.
○ ‫ ﻷﺗﺻدﻗن‬،‫ وﷲ‬:‫ ﻣﺛل‬،‫ﺳم وﺟواﺑﮫ‬ َ َ‫ّﺑﯾن اﻟﻘ‬.
○ ‫ أﺟب‬،‫ إن ﺳﻣﻌت اﻟﻧداء‬:‫ ﻣﺛل‬،‫ْﺑﯾن ﺟﻣل اﻟﺷرط وﺟزاءه‬.
● Punctuation rules may not always apply as speakers might not follow them precisely.
Punctuation should reflect the speaker's natural rhythm and tone.
● Commas can be added for pauses within sentences or to indicate new ideas.
● Question marks should be added for interrogative sentences.
● Exclamation marks can be added to convey strong emotions or forceful utterances.
● Numbers should be normalized, in case of number please write them in arabic
characters, Example:
‫ إﺗﻧﺎﺷر‬:12 ○
Foreign Words
In case of Foreign words:
● If the words can be written in Arabic please do it.
● If there is no way to write the foreign word in Arabic, then please ignore the sample.
● Don’t add {} or [], the transcript cannot contain any English characters.

Bad Sample Removal (Per Sample):

During the process, identify and remove:

● Samples with different speakers compared to the majority of samples.


● Samples with multiple speakers (overlapping or two individuals speaking).
● Samples with excessive noise.
● Samples with notably inferior quality compared to others.
● Samples with numerous pronunciation errors.
● Samples with excessive "Aaaa.." or “ummmm..” sounds.
● Samples with slang, dialectal variations, or non-standard grammar that are
notably different from the majority of samples.

General Feedback (Per Speaker):

Upon completion, provide general feedback on the process using the provided form below, just

fill the form where you shall give a rating on a scale of 0-10:

● Speaker's voice.
● Speaker's style and emotional expression during speech.
● Quality of transcripts before correction (number of mistakes, overall accuracy).
● Variation in keywords usage (repetitive keywords).
Report
Upon completion, please fill up the “Additional feedback on the data.” section in the

provided form to provide a small report that describes the process and the frequent errors you

encountered during the transcripts correctly, this will help us to improve our methods and

engines.

Feedback form

Kindly fill up this form for the labeling process feedback and report

Feedback Form

Some Illustrative Examples

Voice: sample_link
Transcript: ‫ﺗﻌﺎﻟﻲ ﻛﻠﻲ اﻟﻣﮭم اﻧﺎ ﺟﯾت وراﺣﺔ دﺧﻠت اﻟﻣطﺑﺦ‬
Correction: ‫ اﻟﻣﮭم أﻧﺎ ﺟﯾت وراھﺎ و دﺧﻠت اﻟﻣطﺑﺦ‬,‫ﺗﻌﺎﻟﻲ ﻛﻠﻲ‬

● Here we note there are 2 mistakes, First mistake is “‫ ”وراﺣﺔ‬corrected to “‫”وراھﺎ‬, a 1 character
mistake, although the correct and wrong characters have the same pronunciation, but we
would prefer the 100% correct characters.

● Second mistake is the missing “‫”و‬, it’s not that obvious on the record, since speaker said it
fast, however, without it the sentence doesn’t make sense so you could tell there is a missing
“‫ ”و‬even if you think it’s not obvious in the record.

● Third mistake is “‫ ”اﻧﺎ‬because “‫ ”ا‬is supposed to be “‫ ”أ‬. This is important please keep that in
mind because it has totally different phonemes.

● You also note I have added a punctuation “,” because the speaker paused and changed the
subject to another subject, as well as her tone changed, so it was worth adding a comma.
‫‪Voice: sample_link‬‬

‫ﻗﻠت اﻧﺎ ﻻ ﯾﺎ ﻋم ﷲ ﯾرﺿﻰ ﻋﻠﯾك ‪Transcript:‬‬

‫ﻗﻠﺗﻠﮫ ﻻ ﯾﺎ ﻋم ﷲ ﯾرﺿﻰ ﻋﻠﯾك ‪Correction:‬‬

‫●‬ ‫‪” .‬ﻗﻠت”& “اﻧﺎ“ ‪ “ the ASR actually separated it to‬ﻗﻠﺗﻠﮫ“ ‪Here we note that instead of‬‬
‫●‬ ‫‪The “:” was added because it was a quoted message,‬‬

‫‪Voice: sample_link‬‬

‫ﺗدرﯾن اﻧﻲ ﻣﺎ ﻛﺎﻧت ﺗﻘدر ﺗروح ﺗﺷﺗرﯾﮫ ﻣن اﻟﺳوق أﺑدا ‪,‬ﺗﺎﺧذ ﻣن ﺑﯾت ﺟدﺗﻲ و ﺗروح اﻟﺧﯾﺎطﺔ ﺗﺧﻠﯾﮫ ﯾﺳوﯾﻠﻧﺎ ﺑﻼﯾز طوﯾﻠﺔ ‪Transcript:‬‬
‫وﺑﻧطﻠون وﻛﺎن ﺷﻛﻠﻧﺎ ﻋﻛس ﻋﻧد اﻟﻧﺎس ‪,‬ﯾﺎ ﷲ اﻧﻲ ﻛﻧت أﺗﻔﺷل اﻧﻲ اطﻠﻊ اﻟﻌب ﻣﻊ ﻗراﯾﺑﻧﺎ ‪ ,‬وﻛﺎن اﻟﻛل ﯾﺗﮭزﻓﯾﻧﺎ ﺧﺎﺻﺔ ﺟو ﻗراﯾﺑﻧﺎ ﻣن اﻟﻣدن‬
‫اﻟﻛﺑﯾرة ﺣﺗﻰ اﻟﻠﻲ ﻻﺑس و ﷲ اﻟﻌظﯾم اﻧﻲ اﻧﻘﮭر ﯾﻧزﻟون ﺷﺎطر ﯾﻛوﻧون ﻣﻠﯾﺎﻧﺎت اﻟﻣﻼﺑس ﻋﯾوﻧﻲ ﺑس ﻋﻠﻲ اﻗول ﯾﺎ ﺣظﮭم اھل ﺷﯾرو ﻻ‬
‫ﺧﻠﯾﻧﻲ اﻗوﻟك ﺳﯾﺎراﺗﮭم ﺟدﯾدة وﻛﻧت اﻗدر ﻋﻧده ﺷﺎﺷﺎت ﺳﯾﺎرة اﻗول ﯾﺎ ﺣظﮭم ﯾﺎ ﻟﯾت زﯾﮭم و ﻓﻲ ﺑﯾت ﺧﺎﻟﻲ ھذا اﻟﻠﻲ اﻗول ﻟﻛم داﯾﻣﺎ ﻣﮭﺗم‬
‫ﻓﯾﻧﺎ ﯾﻌطﯾﻧﺎ ﻛذا ﺑس ﻣﺎ ﺷﺎء ﷲ ھو ﻋﻧده ﻋﯾﺎل و ﺑﻧﺎت وش ﻛﺛرھم ﻋﺷﺎن ﻛذا ﻣﺎ ﻛﺎن ﯾﺳﺗﻘﺑﻠﻧﺎ ﻓﻲ ﺑﯾﺗﮫ‬

‫ﻣن ﺑﯾت ﺟدﺗﻲ و ﺗروح اﻟﺧﯾﺎطﺔ ﺗﺧﻠﯾﮫ ﯾﺳوﯾﻠﻧﺎ ﺗﺎﺧذ اﻷﻗﻣﺷﺔ‪ ,‬ﻣن اﻟﺳوق اﺑدا ﻣﺎ ﻛﺎﻧت ﺗﻘدر ﺗروح ﺗﺷﺗرﯾﻠﻧﺎ ﺗدرﯾن إن ‪Correction:‬‬
‫ﺟو ﺑﻼﯾز طوﯾﻠﺔ وﺑﻧطﻠون وﻛﺎن ﺷﻛﻠﻧﺎ ﻋﻛس ﻋﻧد اﻟﻧﺎس‪ .‬ﯾﺎ ﷲ إﻧﻲ ﻛﻧت ﺗﻔﺷل اﻧﻲ اطﻠﻊ اﻟﻌب ﻣﻊ ﻗراﯾﺑﻧﺎ ‪,‬وﻛﺎن اﻟﻛل ﯾﺗﮭزﻓﯾﻧﺎ ﺧﺎﺻﺔ إن‬
‫ﯾﻛوﻧون ﻣﻠﯾﺎﻧﺎت اﻟﻣﻼﺑس أﻧﺎ ﻋﯾوﻧﻲ ﺑس ﻋﻠﯾﮫ‪ .‬أﻗول ﯾﺎ ﻗراﯾﺑﻧﺎ ﻣن اﻟﻣدن اﻟﻛﺑﯾرة ﺣﺗﻰ اﻟﻠﻲ ﻻﺑس و ﷲ اﻟﻌظﯾم اﻧﻲ اﻧﻘﮭر ﯾﻧزﻟون اﻟﺷﻧط‬
‫ﺣظﮭم اھل ﯾﺷﺗروﻟﮭم و ﻻ ﺧﻠﯾﻧﻲ اﻗوﻟﻛم ﺳﯾﺎراﺗﮭم ﺟدﯾدة وﻛﻧت اﻗدر ﻋﻧدھم ﺷﺎﺷﺎت ﺳﯾﺎرة اﻗول ﯾﺎ ﺣظﮭم ﯾﺎ ﻟﯾت زﯾﮭم و ﻓﻲ ﺑﯾت ﺧﺎﻟﻲ ھذا‬
‫اﻟﻠﻲ اﻗول ﻟﻛم داﯾﻣﺎ ﻣﮭﺗم ﻓﯾﻧﺎ ﯾﻌطﯾﻧﺎ ﻛذا ﺑس ﻣﺎ ﺷﺎء ﷲ ھو ﻋﻧده ﻋﯾﺎل و ﺑﻧﺎت وش ﻛﺛرھم ﻋﺷﺎن ﻛذا ﻣﺎ ﻛﺎن ﯾﺳﺗﻘﺑﻠﻧﺎ ﻓﻲ ﺑﯾﺗﮫ‬

‫●‬ ‫‪This sample has so many mistakes, I corrected some of them just for illustration.‬‬
‫●‬ ‫‪Also this sample has some pronunciation mistakes, in that case of such mistakes (has to‬‬
‫‪be with numerous pronunciation errors) it can be neglected/removed.‬‬

‫!‪Thank you‬‬

You might also like