You are on page 1of 7

SBS Judge Guidelines

INTRODUCTION
This task is designed to evaluate the quality of the text results returned by ASR (Automatic Speech
Recognition) model which converts the audio into text. Please DO take the time to listen to the audio, then
try to make your evaluation on the converted text results to see whether the converted text accurately
reflects the audio.

Judge Link
http://dl8u0otluguph.cloudfront.net/
1. Please Log in with the account and password provided by your manager.

2. Find your task list from “Special Task”“SBS Tasks”“Project”(Project Name)choose any task in your list
to “Start”;

3. Then you may enter the main page to evaluate:


3-1 Play Audio [Hotkey: “SHIFT+SPACE” to play/pause]
Click to play the audio, listen to it carefully.
Please note, you must play the audio before doing any evaluation. If you do not play the audio, you are not
allowed to do your evaluation results.
3-2 Read the 2 Side-By-Side Results:
After listening to the audio, look at the 2 text results (side-by-side). Then determine which side is better. The
differences between the 2 sides are highlighted for you to easily spot them. Please note the highlights only
show the differences between the 2 sides. It does not mean whether the highlighted part is right or wrong. You
will need to determine which side is right or wrong based on the audio played.

6 Questions to Answer:
1. Fluency Question:
This question is trying to ask you which side is better for you to quickly get the meaning of the played audio.
The result does not have to match the audio word by word. The faster it is for you to get the proper meaning of
the audio the better.
For the example below, Result1 accurately reflects the audio word by word. But from quickly getting the meaning of the audio point of view

(which this question is asking for), Result2 is better.

Audio I would like you know to choose team one ahm actually you know no team two

Result1 I would like, you know, to choose team one. Actually, you know, no, team two.

Result2 I would like to choose team two.

2. Readability Question:
This question is trying to ask you which side is better for you to easily read the converted text.
For the example below, both results correctly reflect the audio. But Result1 is easier to read. Therefore, for this question, Result1 is better.

Audio The total amount is one thousand three hundred and thirty-two

Result1 The total amount is 1,332.

Result2 The total amount is one thousand three hundred and thirty-two.

3. Punctuation Question:
This question is trying to ask you from accuracy of punctuation point of view, which side is better, such as
whether the correct punctuation is used, any missing, or extra punctuation, etc.
4. Word Casing Question:
This question is trying to ask you which side is better on proper word casing, such as any word is mistakenly
capitalized which should be lower case, or any word should be capitalized but is lower case, etc.
5. Word Accuracy Question:
This question is trying to ask you from word accuracy point of view, which side is better, such as whether the
word is spelled correctly, any missing words, or any extra words, etc.
6. Overall Question:
This question is trying to ask you the overall feeling about results. It is possible that one side is better on
Readability while the other side has fewer word errors. In this case, you need to determine which side overall
experience is better based on the scenario.
Rating Scale

The rating scale and the corresponding scores are as follows:


• Left Side Significantly Better (-100)
• Left Side Better (-50)
• About the Same (0)
• Right Side Better (50)
• Right Side Significantly Better (100)
RATING SCALE (SCORING)
When doing the rating, you first pick which side is better. Then, determine how better your favorite side is.
Choosing the right score can be difficult, and it is OK to be a little subjective here. Sometimes the ambiguity
results (converted texts from the audio) make it difficult to apply one specific rating—for these results we ask
you to use common sense and intuition you develop with time to come to the best possible judgment.

Below describes how to determine the level of better.


Scores:
 Significantly Better:
When the better side is accurately reflecting the audio and in good and readable display format, while the other
side has many errors and barely understand the meaning by reading the results (converted text).
 Better:
Both side results (converted text content) have correct meaning of the audio, but one side excels in one or
several aspects, such as (but not limit to):
o Has less word errors,
o Has a better readable display format (e.g., Date, Time, Numbers, etc.),
o Has more correct punctuation,
o Has more correct capitalization
It is totally fine to be subjective here, but please, provide a short comment about why you found some
converted text better than the others.
 About Same:
If the difference between the 2 sides was not noticeable or one side had some factors and lacked some other
factors that another engine had.

Examples:
 Significantly Better:
Audio: Among the total of 980 people, 550 are female. It’s about 56%.
Results:
In this case, the left side is “Significantly Better” than the right side. Because the left side not only had the correct
words but also showed the text in readable format, while the right side had errors that changed the original
audio's meaning
 Better:
1. Audio: Among the total of 980 people, 550 are female. It’s about 56%.

Results:
In this case, the left side is “Better” than the right side. Because the left side is more readable than the right
side, although both sides have the correct words of the audio.

2. Audio: I like you, Will. You are the second Will that I have met and liked within two days. Is there a sign in
that?

Results:
In this case, the left side is “Better” than the right side. Because “Will” is a person’s name. The right side correctly
recognized it as people’s name with upper case “W” while left side didn’t and put lower case “w”, although both
sides have the correct words of the audio.
3. Audio: Oh, say, that's different, observed Markham, altering his demeanor.
Results:
In this case, the left side is “Better” than the right side. Right side incorrectly recognized I'll as Oh. Also, the right
side missed all the punctuation.

 About the Same


Audio: Among the total of 980 people, 550 are female. It’s about 56%.

Results:
In this case, both sides are missing one word. Although the missing word is different, the impact of the missing
word is about the same. Therefore, the rating is “About the Same”.

 About the Same: insertion error

If both sides output a lot of contents which are not in the audio, which we call “insertion error”, the rating is
“About the Same” since both are bad. If only a little output, please ignore the out-put part since it may be
caused by the alignment issue during segmentation and judger still do the rating based on the majority
contents.

Stopping Criteria:
For round 1, we stop showing an utterance to you for judgment when consensus is reached for the utterances.
The consensus criteria chosen is that there is a matching score from 3 users i.e., For example, if we receive a
judgment score of 50 from 3 users for an utterance, we stop this utterance for any further judgments.
Some may have to rework based on the client’s specific requests after round 1’s quality check.
After the client’s final acceptance, we stop this utterance for any further judgments.

Report Problem:
You may report any problems with a particular utterance to PM.
Comments:

After making a judge, please detail in the comments why you favored one side over the other. Your comments
can relate to an overall impression, one specific judging criterion, or multiple criteria. It's entirely up to you.
Please make your comments clear, concise, and free of unnecessary words.

No matter which languages you are, All COMMENTS MUST WRITTEN BY ENGLISH

Your comments should encompass three primary parts of information:

1. Reference Side: The side (Left or Right) you are referring to.
2. Factual Basis: The facts that shape your opinion about results. What specific differences in the
transcriptions shaped your decision? Reference the converted text from the audio.
3. Sentiment: Express how those differences impacted your understanding or preference.

Here are some examples of the comments provided with facts and sentiment highlighted:

 Although both sides have the correct text content from the audio, the Left side is showing the Date in Date format
(10/21/2022) while Right side is showing the Date in words (October the twenty first two thousand and twenty two).
So, the Left side is easier to read.
 The Right side has fewer errors, so that I can still get the meaning of the audio. The Left side has too many errors
(e.g., “that” should “then”, missing word “them”, wrongly placed the Comma,), and it’s hard for me to understand
what it is talking about.

Please keep both facts and sentiments intact and try not to write your comments in the form of keywords. Please remember, you're
communicating with real people. Preserve grammatical structures and logical coherence to ensure your comments are easily
understood.
Below lists some common issues of comments and suggested improvements.
Issue Example Suggested Improvement
Right side better reflects the audio Provide examples of what right side is better
(e.g., on word correctness, punctuation, the
display format, or what), or left side is wrong
Comments are
Left side has reco errors List what reco error is on the left side
too generic
Left correctly recognizes all words List what right side did it wrong on recognizing
the words
Left Reco is better but has errors List what errors left side has (give examples)
Right side is better than left side because it has all It would be better to also list examples of what
correct words, correct punctuations left side did wrong on words or punctuation
The Left Reco is better. The Left Reco matches the It would be better to also list examples of what
Lack of
speech word by word right side did wrong
examples
In this case, left side is "Significantly Better" when It would be better to also list examples of what
compared to right side as it is free of errors and is right side did wrong
according to the speech

You might also like