You are on page 1of 30

Nimbus

Version 1.1

Introduction 2
Overview of Task 3

Implicit Questions 4

Spelling and/or Grammatical Errors 5

Step 2: Validate the Query 6


Step 3. Validate the Response 9
Step 4: Assess the Accuracy of the Response 10
Step 5: How Grounded is the Response 13
Notes on Groundedness: Searching Verbatim Text? 15

Notes on Groundedness: Main Claims and Partial Claims 15

Step 6: How Satisfying is the Response? 16


Principles of Satisfaction for Responses 16

Types of Responses 17

Grading Scale for Responses 18

Response Type Speci c Guidance 20

Step 7: Assign Overall Preference Rating (where Applicable) 21


Examples 22
Step 2: Query Validation 22

Step 5: Groundedness 23

Step 6: Response Satisfaction 25


fi
Introduction
The purpose of this task is to evaluate the quality of our answers to users questions. In this task, you'll be presented with:

• A user s query that likely asks a question,

• One response containing a few lines of text that answers the question, or two responses presented side-by-side.

In some cases you ll also see:

• One or more response source(s): the web page(s) that the response comes from. You can visit a source by clicking the superscript number or the URL
below the response (see below for an example). Note that the same source number in di erent sides may refer to di erent sources.

Example Image of Side By Side Grading on Different Grading Platforms



ff
ff

Overview of Task
The task is to review the provided query, the response(s) and the source(s) (if present), and then answer a few questions.

The grading task consists of ve main steps:

1. Understand the Question. Using popular web search engines, make sure you understand what the question is asking and what it refers to.

2. Validate the Question. Check to make sure there s nothing about the question that makes it impossible or inappropriate to grade the answer. (If
you decide the question is invalid, you will skip the remaining steps.)

3. Validate each Response. Before you can grade the satisfaction of the Response, you ll indicate whether there are any problems that would
prevent you from judging it.

4. Assess the factual accuracy of each response. Using web search engines and trustworthy sources, you ll determine whether the Response is
accurate.

5. Assess the groundedness of each response. All responses should have their facts grounded in cited sources. We do not want the responses
providing unsupported statements.

6. Assign a Satisfaction Grade to each Response. Using the scale described below, you ll indicate how satisfying the response is given the user s
question.

7. (For tasks with two Responses) Indicate which Response you d prefer. Using the scale described below, you ll indicate which Response is better
given the user s question.

These steps are described in detail below.

Common Grading Mistakes Page 3 of 30


fi

Step 1: Understand the Query
The rst step is to make sure you understand what the user is asking. If you don t understand what the query refers to, you won t be able to
determine if the answer is correct. As a simple example, if the query was who played m and you didn t know that m was the name of a recurring
character in the James Bond movie franchise, you might think that the question was nonsensical.

Also, many queries can refer to multiple things that have the same name. It s important to know which of these meanings is the most likely
interpretation. For example, there are many people named Michael Jordan. If the question was who is Michael Jordan? the most likely interpretation
is the former basketball star. But if the question was who is Michael Jordan a professor? the most likely interpretation is the machine learning expert.

Use the links to popular search engines to make sure you understand the primary interpretation of the query.

Implicit Questions
Note that even if the query isn t explicitly phrased as a question, there is often one or more obvious interpretations as a question, especially if the
query is a person, place, or task. For example:

Query Implicit Interpretation


Who is Barack Obama?
Barack obama

conakry Where is Conakry?

change a tire How do I change a tire?

convert 11 am pst to cet? What is 11 am PST in CET

Queries and their implicit interpretation


fi

Spelling and/or Grammatical Errors


If a query contains spelling and/or grammatical errors, but it s clear what the user was trying to write, proceed as if the correct spelling and grammar
were used.

Common Grading Mistakes Page 5 of 30


Step 2: Validate the Query


A query must be valid for the corresponding answer to be assigned a satisfaction grade. In general, valid query questions tend to be those that have a
correct answer and can be answered with information available on public web pages. In Step 2, you ll mark all reasons why the query question might
not be valid, chosen from this list

Category Description Examples


• what time does school start requires
Mark this choice if the question requires
knowing what school the user attends.
Requires Personal Information personal information about the user to know
• how can i turn o my phone requires
the right answer.
knowing the user s phone model.
Mark this choice if the question is explicitly
• “change mac desktop picture.”,

Topic is Apple Product Help seeking help about how to use an Apple
• “how to turn o my iPhone”.
product. Examples:
Mark this choice if the query doesn t seem to
• Go Chargers
be asking a question, even implicitly, or is
• lweriojsdf,
Query has No Clear Question incomplete, or so garbled or full of misspellings
• is biden veto
and grammatical errors that you don t know
• how much does"
what it s asking, or has empty content.

The query is asking something whose factual


correctness cannot be veri ed. For example, • which color is better for a car, blue or red? ,
it s asking for a matter of opinion, or • what happens to our soul after we die? ,
Query has No Clear Answer
speculation involving religious faith, or related • where does the easter bunny live? ,
to ctional characters/events such as Santa • when was world war iii?
Claus, the Easter Bunny, or the Tooth Fairy.

fi
ff
ff
fi
▪ Locale is en-US, query is Japanese phrase
出る釘は打たれる. ̶ Wrong Language.
▪ Locale is de-DE, query is French phrase
plus ça change, plus c'est la même chose. ̶
Wrong Language.
Queries must be either in the locale language, ▪ Locale is ja-JP, query is uncertainty
or in English, or be a well-known entity such as principle. ̶ This is NOT wrong language,
Wrong Language
a song or movie. If that is not the case, mark it because it s English, even though it s not the
as Wrong Language. locale language.
▪ Locale is en-US, query is Spanish title
Despacito, ̶ This is NOT wrong language,
because it s the name of a famous song that s
well known in this locale.

Mark the question as Inappropriate if it • Hate speech, discrimination, prejudice, or


contains any of the following: stereotypes about particular identities
• Hate speech, discrimination, prejudice, or (race, religion, ethnicity, sexual orientation,
stereotypes about particular identities (race, religion, etc.).
Inappropriate ethnicity, sexual orientation, etc.). • Profanity (i.e. swear words )
• Profanity (i.e. swear words ) • Sexually explicit or pornographic content
• Sexually explicit or pornographic content • Graphic descriptions of violence or death
• Graphic descriptions of violence or death • Content encouraging violence or illegal
• Content encouraging violence or illegal activity activity

Common Grading Mistakes Page 7 of 30


A question is Unsafe if it is seeking advice on
these topics:

• the answer to how many days does this


Medical emergencies (This does not include
month have depends on what month it is
Unsafe non-emergencies/not urgent care medical
• what is the horoscope for libra today ,
questions such as what are the symptoms of
• days to thanksgiving .
covid or what is aspirin used for )

or mental health emergencies


A question is Time-Sensitive if the correct • the answer to how many days does this
answer changes more frequently than every 6 month have depends on what month it is.
Time Sensitive
months ̶ for example, daily, weekly, or • what is the horoscope for libra today
monthly. • days to thanksgiving .

None of the above (query is valid)

Query Validation

⚠ Note that if the question is invalid ̶ that is, if you select any
choice(s) other than None of the above ̶ you ll skip the remaining
steps and will be done with this query and response.

Common Grading Mistakes Page 8 of 30






Step 3. Validate the Response
In Step #3, you ll mark all reasons why the response might not be valid, chosen from the following list. If there are more than one response, you
should examine each of them.

Ratings Description
A response is in the wrong language if all of the following statements are true:
• The response is NOT in English.
Wrong Language • The response is NOT in the language of the user s locale.
• The query and response are NOT in the same language.

A response is inappropriate if it it contains any of the following:


• Hate speech, discrimination, derision, prejudice, or stereotypes about particular identities (race,
religion, ethnicity, sexual orientation, etc.).
Inappropriate • Profanity (i.e. swear words )
• Sexually explicit or pornographic content
• Graphic descriptions of violent, physically dangerous, or illegal activities
• Content leads to self-harm, violence, or illegal activities

Response Validation

Step 4: Assess the Accuracy of the Response


An accurate response correctly interprets and answers the question as veri ed by multiple trustworthy sites. Accuracy is one of the research
intensive parts in this grading process.

There are di erent types of accuracy in that the question might ask a precise question with single veri able answer (e.g. Who won the Soccer World
Cup 2022 Finals? ). However, sometimes the question might have had di erent answer at di erent instances of time (e.g. Who was the winner of the
Oscars ) and the answer would need to take into account the most obvious interpretation (latest year) or qualify its answer.

At times it is very di cult to rate the accuracy of some responses because of lack of veri able sources (e.g. question about games, or responses
found on user generated sites such as Quora and Reddit).

⚠ Medical Accuracy
⚠ Small Disagreements
Medical accuracy should be judged using trusted health websites
For queries asking for a number, a date, or a time period whose such as MayoClinic, NHS, NIH, Medline, CDC, Merck Manuals. This
exact value is hard to verify, di erent sources may have some small and this website have links to trusted health sites.
disagreement. An accurate response to these queries should be in
rough agreement across multiple trustworthy websites. Examples:
• duration of wars in history,
• how old is the oldest tree? ,
• what is the population in Los Angeles metropolitan?

ff
ffi
ff
ff
fi
fi
ff
fi
Accuracy Description Examples
• Factual questions related to entities such as age of
someone , or who did something should be based on
most recent data to be considered.
The response correctly and completely answers the
Accurate dominant interpretation of the query as veri ed by • For example, the accurate answer to Who owns range
multiple trustworthy sites. rover is Tata Motors since they are the current
owners.

• The answer In 2008, Jaguar owned land rover to the


query Who owns range rover is Partially Accurate ̶
Choose this when the response is accurate but has one or since the response correctly answers a niche
more of the following problems: interpretation and quali es the answer with the correct
• The interpretation of the query is not what most users year.
Partially Accurate are looking for, but the response is accurate. • Lists that contain some but not all items (maybe leaving
• The answer is incomplete (such as a list missing out the non primary items, or it contains most but not all
important items), but the provided information is relevant entries) are Partially Accurate. For example, the
accurate. response Brendan Fraser, Austin Butler, Collin Farrell to
the query Oscar Best Actor Nomination is Partially
Accurate, since it leaves out two other nominations.
• If the question is who still alive from the golden girls
and the answer is Of all the cast members, Betty White
is the only one alive. She portrays Rose Nylund, a
Norwegian American from Minnesota , then the answer
Not Accurate Answer is wrong(or has the wrong interpretation) is Not Accurate since Betty White died in 2021.
• The response Range Rover is owned by Jaguar to the
query Who owns range rover is Not Accurate, since the
response is not currently correct.

Common Grading Mistakes Page 11 of 30


fi
fi
Accuracy Description Examples
• If the query is how many cloves of garlic and the
response is Most garlic bulbs you'll nd at the grocery
store contain 10-12 cloves. , this is rated as Unsure as
this number is di cult to verify (and it depends on the
Choose this very rarely and only when with research, you size of the bulb).
nd it s extremely di cult to verify the accuracy because
• If the question is Are all US Navy SEALs considered elite
• the response comes from user reported data such as or only SEAL Team Six? , then the answer from Quora
anecdotal or user generated content which are di cult (see top right) is di cult to verify or to ascertain
Di cult to Verify to verify. accuracy (it s a generic question and statement)
• there are multiple answers from di erent sources that
con ict with each other.

Accuracy Ratings

Common Grading Mistakes Page 12 of 30


fi
ffi
fl
ffi
ffi
ffi
ff
fi
ffi
Step 5: How Grounded is the Response
An response typically contains claims. For example, a response to how many sides does an isosceles triangle have? could be

The response on the right makes two claims:


An isosceles triangle has three sides; two equal sides and one unequal
• an isosceles triangle has three sides; and side ¹

• an isosceles triangle has two equal sides and one unequal side. [1] https://www.omnicalculator.com/math/triangle-
height#:~:text=An%20isosceles%20triangle%20is%20a%20triangle%20with
%20two%20sides%20of%20equal%20length.

In this case only the last claim is supported by the cited source ̶ there is nothing in the source that says isosceles triangles have three sides.

You ll be asked to assess the groundedness of the claims.

Groundedness
A claim is grounded if all the information it conveys is supported by the cited source. This is not the same as accuracy ̶
claims can be grounded but inaccurate, or accurate but ungrounded.

You should check the referenced sources by clicking the superscript numbered URL links, and choose from the following options.

Groundedness Description Examples


• Factual questions related to entities such as age of
someone , or who did something should be based on
most recent data to be considered.
All claims are cited and can be found in their respective
Fully Grounded • For example, the accurate answer to Who owns range
citations.
rover is Tata Motors since they are the current
owners.

• The answer In 2008, Jaguar owned land rover to the


query Who owns range rover is Partially Accurate ̶
since the response correctly answers a niche
interpretation and quali es the answer with the correct
year.
Some claims do not have sources or cannot be found in
Partially Grounded • Lists that contain some but not all items (maybe leaving
any of the sources.
out the non primary items, or it contains most but not all
relevant entries) are Partially Accurate. For example, the
response Brendan Fraser, Austin Butler, Collin Farrell to
the query Oscar Best Actor Nomination is Partially
Accurate, since it leaves out two other nominations.
• If the question is who still alive from the golden girls
and the answer is Of all the cast members, Betty White
is the only one alive. She portrays Rose Nylund, a
None of the claims can be found in of the referenced Norwegian American from Minnesota , then the answer
Not Grounded sources. Choose this if the source is not reachable (e.g., is Not Accurate since Betty White died in 2021.
webpage doesn t exist or redirects to an unrelated page). • The response Range Rover is owned by Jaguar to the
query Who owns range rover is Not Accurate, since the
response is not currently correct.

Accuracy Ratings

Common Grading Mistakes Page 14 of 30


fi
Notes on Groundedness: Searching Verbatim Text?
The claim can exist in the source but not literally as written in the response. Moreover, you will need to expand sections in a webpage in order to
nd the supporting evidence. For example, you will need to click Entry, Exit and Visa Requirements in this source to see the content. For example

• what are the di erent generations of mustangs? , the answer contains The 7th generation is available as a coupe and convertible with powertrains
ranging from a turbo-four to a V-8, taken from the source. The actual text however is The Mustang will enter its seventh generation . Searching for
the exact text ( 7th generation ) will not yield a match.

• Another example is the answer to is a passport needed to travel to bahamas? which is A passport is required to travel to the Bahamas.¹ All
Americans, including children and infants, must have a valid passport when traveling by air.² Even if ying through the US, a passport is still
mandatory.³ .

• The rst citation implicitly says a passport is required though the text is not literally there (hence this claim is Grounded).
• The last citation is a 404 and hence Not Grounded
• Taken together this response is Partially Grounded

Notes on Groundedness: Main Claims and Partial Claims


Answers often have a main claim that directly answers the question and then minor/supporting claims that add context/additional helpful information.
We provide guidance for the Satisfaction rating on how to incorporate this.

Common Grading Mistakes Page 15 of 30


fi
fi
ff

fl
Step 6: How Satisfying is the Response?
In this step, you ll focus on the response, and determine how well it satis es the user s information need. If there are more than one response, you
should assess each of them.

Principles of Satisfaction for Responses


Use the following principles while grading for Satisfaction

Relatedness Is the information exactly what the user was asking about, or is it answering a slightly di erent question?

Correctness Is the information provided in the response correct as of the current date?

Completeness Does the information provided in the response answer the user s question completely, or only partially?

Context Does the response contain enough context to make the answer as useful as possible?

Can the user easily identify the useful information in the response? We wish to provide our users with answers that are address the question without
Conciseness super uous wording. Responses should be concise: they should not include extraneous text that is not needed to answer the question (for example
repeating text, restating phrases already mentioned in the text, or adding extra facts that do not provide additional insight).

Does the response read like an answer to the question? Is the information presented in a way that most users can understand it? Is the response
Readability
grammatically correct? Do the multiple sentences in the response logically t together to form a coherent answer?

(If any reference is present) Is a statement in the response supported by the reference, or is it irrelevant to or even contradicting what is cited? Check the
Grounding
cited page (if present) and see how grounded the response is.

Principles of Satisfaction

fl

fi
ff
fi
Use all the above to help you choose your rating. For example, consider the question Who played James Bond in Casino Royale? The answer David
Niven is technically correct, related, and understandable, but it is incomplete and missing important context: Niven played Bond in a 1967 version of
Casino Royale that was a parody of spy movies, not the better-known 2006 lm starring Daniel Craig. While Daniel Craig would be a more satisfying
answer, better still would be an answer that mentioned both actors and explained which lms they were in.

Types of Responses
• Textual: a passage of text that answers the questions. Sometimes these will be associated with a short answer which is a smaller snippet from the
passage

• List: a list of answers such as the symptoms of strep throat. Please scan the list to nd the answer the question (note, the correct answer may not

List Answer

be present)

• Table: a tabular representation that contains the answer. Please scan the table for the answer (if it present).

Table answer

Common Grading Mistakes Page 17 of 30


fi
fi
fi

Grading Scale for Responses
You ll use the following scale to grade the response with some speci c guidance for di erent response types:
• Highly Satisfying. Almost all users would want to see this response.
• Satisfying. Many users would be interested in seeing this response.
• Somewhat Satisfying. Some users may nd this response useful, but it s probably not what most searchers were looking for.
• Not Satisfying. The response provides no value to the user and should not be shown. You should rate a response Not Satisfying if it is
Inappropriate.

Now let s look at more details about when to choose each grade. In each case, you ll choose that grade if any of the situations listed apply. The
guidance on when to choose ratings applies to all response types, where applicable we have called our response speci c guidance.

Groundedness Description Notes Response Speci c Guidance (in


addition to the above)
Highly Satisfying • Is accurate, comprehensive, up-to-date, and addresses the most likely • Do not rate answers Highly Satisfying simply • For list-based responses, that list
user need(s). because they are correct and long. completely and directly answers the
user s query.
• Gives the correct answer clearly and concisely. • When the question can have multiple
interpretations without a dominant one, do not • For table-based responses, that table
• Presents information in a gramarly correct, easily understandable,
choose Highly Satisfying since neither answer will contains a cell that completely and
and logically coherent way.
satisfy everyone. directly answers the user s query and is
• (If references are present) The response is Fully Grounded. All the easy to discover.
information in the response is supported by the sources it cites.

Common Grading Mistakes Page 18 of 30


fi

fi
fi
ff
fi
Groundedness Description Notes Response Speci c Guidance (in
addition to the above)
Satisfying • Leaves out some details or useful context. (If references are present) A minor/supporting • For list-based responses, the answer is
claim (i.e. a claim that is not essential to answering not as direct as possible but you can
• Includes a lot of other information that makes it harder to nd the
the question) is ungrounded, but still accurate. The quickly nd the answer within the list,
answer.
main claim however needs to be grounded. For and other entries in the list may add
• Contains some grammatical, spelling, or formatting errors, but is example, in response to is water colorless useful context to the answer.
overall understandable to most users.
• For table-based responses, no cell from
• Is more simpli ed or more complicated (and often more wordy) than the table can directly answer the
Water has a slight blue color that becomes deeper as
it needs to be. the thickness of the observed sample increases[1] question, but there is a cell which might
This is because red, orange and yellow (long answer a simpli ed or generalized
• Answers a slightly more general question.
wavelength light) are absorbed more strongly by water version of the query.
• Uses strongly opinionated languages that sounds like an than is blue (short wavelength light)[2] When white
advertisement. light from the sun enters the ocean, it is mostly the
blue that gets returned[3] , which is why the ocean
appears blue.

The question is directly answered in the rst claim.


If the last claim lacked a citation or could not be
found in [3]. this a Satisfying response.

Somewhat • Is inaccurate on some details. • For list-based responses, the list


Satisfying contains the correct answer, but it is
• Answers a di erent question, based on a less common interpretation
unnecessarily long such that it may take
of the query. you several seconds to nd the correct
• Provides only a partial answer to the question. answer within the list.

• For table-based responses, there is a


• Is technically correct, but misleading.
cell which might partially answer the
• Is poorly written, verbose and/or unclear. This includes the case query, but is clearly incomplete or
where sentences in the response do not t together in a logical, unclear or very redundant.
coherent way.

• (If references are present) Several minor/supporting claims (i.e.


claims that are not essential to answering the question) are
ungrounded(either lack a citation or cannot be found in the citation),
but they are still accurate. The main claim needs to be grounded.

Common Grading Mistakes Page 19 of 30


fi
ff

fi
fi
fi
fi
fi
fi
fi
Groundedness Description Notes Response Speci c Guidance (in
addition to the above)
Not Satisfying • Has nothing to do with the query (e.g. answers a di erent question).

• Fails to answer the question.

• Provides incorrect information (including information that was


correct at some time in the past but is no longer correct today).

• Is completely not understandable (e.g., gibberish)

• Is inappropriate as you indicated in the previous question.

• The response is Not Grounded. For example, the answer might be


accurate but none of the claims could be found in the cited sources.

• The response is Partially Grounded, but the most important claims in


the answer (i.e. the part of the answer that most directly answers the
question) is not grounded in the cited sources.

• Some of the claims in the response is not supported by the source it


cites, and these claims are inaccurate or contradicting with common
consensus.

Accuracy Ratings

Response Type Specific Guidance


For table responses, we try to place a Table Caption and Table Context Heading. Both are there to provide context to understand the table (though you
should always click on Response Source to get actual context for the table).
• The satisfaction rating is not a ected if either are missing.
• If any of the Table Caption or Table Context Heading are irrelevant or gibberish, demote the rating one level (e.g. Satisfying → Somewhat
Satisfying).

Common Grading Mistakes Page 20 of 30


fi
ff
ff

Step 7: Assign Overall Preference Rating (where Applicable)


If two responses are present, you ll be asked to give an Overall Preference Rating (OPR), which indicates which response you prefer. Choose the OPR
considering the satisfaction principles:
• Relatedness: Prefer the response that is more related to what the query is asking.
• Correctness: Prefer the response that is more factually accurate. For queries without a single correct answer, prefer the response that is more
satisfying to the majority of users.
• Completeness & Context & Conciseness: Prefer the response that provides more useful information needed to answer the question with fewer
extraneous wording or unnecessary information.
• Readability: Prefer the response that is more understandable, more pleasing to read, and makes more logical sense.
• Grounding: Prefer the response that is more supported by the cited sources.

Examples

Step 2: Query Validation

Examples for Query Validation

Query Response Validation Comments


press the side button and volume up Apple Help This is an Apple Help question. Also, note that this answer is only correct for certain models of
iphone screen shot
button at the same time Topic iPhone.
The rst quiche to come to the
attention of the American public No Clear
best quiche recipe There is no clear answer to the question, since the "best" recipe is a matter of opinion.
was the quiche Lorraine in the Answer
1950s...
who was voted of
Time-Sensitive The events of a weekly TV show change every week
Survivor Island last week
jules verne novel about
submarine Wrong Response is not in English, and is not a well-known title in this locale. (In English-speaking countries,
Vingt mille lieues sous les mers
Language the book is known as Twenty Thousand Leagues Under the Sea.)
[asked in en-US]
Even though the query is in German, it's the name of a famous movie that is best known by its
das boot Das Boot is a 1981 West German
German title even in the US, so this is not Wrong Language. Also note that most of the answer is in
war lm written and directed by None
English, so the response is clearly aimed at English speakers. Also this is an example of an implicit
[asked in en-US] Wolfgang Petersen...
question "What is das boot"
Click On "Download" Button And
west side story
Follow Steps To Download And Inappropriate The result is encouraging illegal activity, namely, pirating a copyrighted movie.
download
Watch Movies Online For Free

An opioid overdose can be serious


fentanyl overdose Unsafe The question involves a medical emergency.
and life-threatening.

The Baltic States is an uno cial


what are the baltic
geopolitical term for three countries: None Even though the answer is a list of three countries, it's the answer to a single question.
states
Estonia, Latvia, and Lithuania.

fi
fi
ffi

Step 5: Groundedness

Examples of Groundedness
Response
Query Response Groundednes Comments
s
Confederate soldiers are buried in various locations
where are the
including Arlington National Cemetery, [1] All the listed burial locations are mentioned in the cited sources, making the response Fully
confederate Fully Grounded
Lewisburg, West Virginia, [2] Oak Hill [3] and Oak Grounded.
soldiers buried?
Woods Cemetery in Chicago [4].
Yes, Emily Blunt and John Krasinski have been
are emily blunt and married since 2010. [1]They got married in an
john krasinski intimate ceremony at George Clooney's house in Fully Grounded All the claims are mentioned in the cited sources.
married? Lake Como, Italy on July 10, 2010 [2] and have two
daughters together, Hazel and Violet. [3]

U.S. citizens are generally required to present a


is a passport
valid U.S. passport when traveling to The Bahamas, The information in the claim is present in the cited source. Note that you have to click and expand
needed to travel to Fully Grounded
as well as proof of anticipated departure from The the "Entry, Exit, and Visa Requirements" section to nd it.
bahamas?
Bahamas. [1]
The information in the claim is present in the cited source. Note that you have to combine multiple
the kid who would
Rebecca Ferguson plays Morgana in the movie The sentences to see it, e.g. "The lm is written and directed by Joe Cornish and stars ... with Rebecca
be king cast Fully Grounded
Kid Who Would Be King. [1] Ferguson and Patrick Stewart" and "...together with the legendary wizard Merlin (Stewart), take on
morgana?
the wicked enchantress Morgana (Ferguson)."
Morgan Freeman has won an Academy Award, a
Golden Globe Award, and a Screen Actors Guild
how many awards The 1st and 3rd sentences are supported by their respective sources, while the 2nd one is not.
Award. [1] He has also been nominated for 5 Partially
has morgan Also this would be Partially Accurate (since Morgan Freeman has won 2 Golden Globe awards).
Academy Awards and 5 Golden Globe Awards. [2] Grounded
freeman won? Since this is an issue with the main claim, this would be Somewhat Satisfying.
In total, he has been nominated 79 times and has
won 61 awards. [3]

The United States of America declared its


what is the independence from the British Empire in 1776 with
The answer is partially accurate, but the answer is only partially found in source (the rst bit) but
di erence between the Declaration of Independence and the 50 stars
Partially the source does not say what is the UK(moreover it is not analgous to the USA and it states). This
the united states on its ag represent the 50 U.S. states. The United
Grounded answer is somewhat satisfying because it is partially accurate and part of the main answer answers
and the united Kingdom is a sovereign state made up of four
the question partially.
kingdom? countries; England, Scotland, Wales and Northern
Ireland, analogous to the USA and its states [1]

Common Grading Mistakes Page 23 of 30


ff
fl
fi

fi
fi
Response
Query Response Groundednes Comments
s
A passport is required to travel to the Bahamas.[1]
is a passport All Americans, including children and infants, must
Partially The answer is provided in the rst claim and is supported in the citation (though not literally). The
needed to travel to have a valid passport when traveling by air.[2] Even
Grounded last claim is a 404, but is a minor claim. Because this is a 404, this response would be Satisfying.
bahamas? if ying through the US, a passport is still
mandatory.[3]

1 pint is equal to
One pint is equivalent to 16 ounces. [1] Not Grounded The source is an unrelated website.
oz?

12pm cst is what The claim is not supported by the source. Instead, the source suggests that 12pm CST is 10am
12pm CST is 4pm PST. [1] Not Grounded
time pst? PST.

what is the High Point University is a private liberal arts college


acceptance rate at with an acceptance rate of 77%.[1] For the The rst citation doesn't load and the second source doesn't match the answer. This answer is
Not Grounded
high point academic year 2020-21, the acceptance rate is inaccurate and not grounded. It would be Not Satisfying
university? 77.05% and the yield is 15.98%.[2]

Common Grading Mistakes Page 24 of 30


fl
fi
fi
Step 6: Response Satisfaction

Satisfaction Ratings (double click on images to view them)

Query Response Response Comments


Satisfaction
what is the The UK is a sovereign state made up of England, Highly Satisfying The response completely and correctly answers the question. The provided information is fully supported by the
di erence Scotland, Wales, and Northern Ireland. Great Britain cited source.
between the uk is an island comprising of England, Scotland and
great britain and Wales, and the British Isles is a group of over 6,000
the british isles? islands, of which Great Britain is the largest. [1]

where did the The term "metaverse" has its origins in the 1992 Highly Satisfying The response is not perfectly comprehensive (for instance, it doesn't mention the author of the novel), but it does
word metaverse science ction novel Snow Crash as a portmanteau answer the question, and is clearly a result almost all users would want to see.
come from of "meta" and "universe."

how many planets There are eight planets: Mercury, Venus, Earth, Highly Satisfying The response provides useful context (the names of the planets) to completely answer the question. It is also up-to-
are in our solar Mars, Jupiter, Saturn, Uranus, and Neptune. date (not including Pluto).
system
who played james Daniel Craig starred as Bond in Casino Royale Highly Satisfying Answers the question completely and provides meaningful context.
bond in casino (2006), while David Niven played Bond in the 1967
royale lm of the same name.

rationalism Rationalism is the philosophical view that Highly Satisfying Clear answer, explained as simply as possible without being oversimpli ed.
knowledge is acquired through reason, without the
aid of the senses.

je bezos Je Bezos was reportedly involved in a fatal Highly Satisfying Provides a direct, complete, accurate, and concise answer.
helicopter crash helicopter crash that gave him minor head
lacerations in the year 2003.

Common Grading Mistakes Page 25 of 30


fi
ff
ff
ff
fi

fi
Query Response Response Comments
Satisfaction
what are some Signs and symptoms of strep throat can include: Highly Satisfying Provides a direct, complete, accurate and concise answer. Every item in the list contains highly relevant information.
signs of strep • Throat pain that usually comes on quickly
throat • Painful swallowing
• Red and swollen tonsils, sometimes with white
patches or streaks of pus
• Tiny red spots on the area at the back of the roof
of the mouth (soft or hard palate)
• Swollen, tender lymph nodes in your neck
• Fever
• Headache
• Rash
- Nausea or vomiting, especially in younger children
- Body aches
what's the tallest Highly Satisfying The cell "One World Trade Center" provides a direct, complete, concise answer. The information in the table is
building in new accurate and highly relevant.
york city

what's the Highly Satisfying The cell "$3,000‒$5,000" under Invisalign provides a direct, coplete, and consice answer. The table itself is highly
average cost for relevant.
invisalign

how the titanic Highly Satisfying The response contains "Collision with iceberg on 14 April" under "Cause" which answers the query correctly and
sink concisely.

who voices pidge Pidge in Voltron is voiced by Bex Taylor-Klaus. [1] Satisfying The response correctly answers the question. The 2nd claim, despite being accurate, does not have a citation. Given
in voltron? She has also voiced Taylor in Hell Fest. that this is a minor claim, the response is Satisfying.

is a passport A passport is required to travel to the Bahamas.[1] Satisfying The answer is accurately provided in the rst claim and is supported in the citation (though not literally). The last
needed to travel All Americans, including children and infants, must claim is a 404, but is a minor claim. Because this is a 404, this response would be Satisfying.
to bahamas? have a valid passport when traveling by air.[2] Even
if ying through the US, a passport is still
mandatory.[3]

Common Grading Mistakes Page 26 of 30


fl
fi
Query Response Response Comments
Satisfaction
who played james Daniel Craig Satisfying Gives the answer that user was probably looking for, but omits useful context since David Niven also played Bond in
bond in casino the 1967 lm of the same name.
royale

how many planets Most large objects in orbit around the Sun lie near Satisfying Contains the correct answer, but user has to read through a bunch of other information to nd it.
are in our solar the plane of Earth's orbit, known as the ecliptic. All
system eight planets are very close to the ecliptic, whereas
comets and Kuiper belt objects are frequently at
signi cantly greater angles to it.

beagle litter size The average litter for a domestic dog consists of Satisfying Answers a slightly more general question (for dogs rather than beagles), but the answer is still valid.
about six puppies

is falafel Falafel are vegan, nutritious and super delicious. Accurately describes falafel as vegan but written in an informal, opinionated way.
vegetarian
rationalism In philosophy, rationalism is the epistemological Satisfying Answers the question in a generally understandable way but uses specialized terms like "epistemological" that many
view that "regards reason as the chief source and users would not be familiar with.
test of knowledge."

which states Salem Salem , capital of Oregon, U.S., and the seat Satisfying Provides an accurate answer to the question. However, it is not a complete sentence, and the word "Salem" appears
capital is salem (1849) of Marion county. twice in the beginning of the response.

what area code is The state of Mississippi is served by the following Satisfying Completely answers the question and is accurate but only the rst entry of the list answers the question.
228 area codes:

• 228, which serves the Gulf Coast region of


Mississippi

• 601/769, which serve most of southern


Mississippi including Jackson

• 662, which serves northern Mississippi

who is the voice Satisfying The cell under "Voices of" contains the correct answer "Tommy Davidson" but it answers a more generic query "who
of oscar proud are the voice of the proud family" instead of the speci c "Oscar Proud" person
from the proud
family

what is peoa ut Peoa, Utah has a ZIP code of 84061. It is located in Somewhat The essential claim to answer the question is accurate and grounded. However, the remaining two minor claims,
zip code? northern Utah and covers a slightly less than Satisfying despite being correct, are not grounded by any cited source. It also provides extraeous infomation that is not
average land area compared to other ZIP codes in necessary to answer the question..
the United States. The sales tax rate of 7.15%
applies in the zip code 84061. [1]

Common Grading Mistakes Page 27 of 30


fi
fi
fi
fi
fi
Query Response Response Comments
Satisfaction
obamacare The A ordable Care Act, colloquially known as Somewhat The query should be interpreted as asking the implicit question "what is obamacare." The response tells us it's a U.S.
Obamacare, is a landmark United States federal Satisfying law, but it doesn't say what that law does.
statute...

where did the Neil Stephenson's science ction novel Snow Crash Somewhat Answers the question completely, but misspells the author's rst name.
word metaverse Satisfying
come from

where did the Neal Stephenson Somewhat Answers a slightly di erent question (who came up with the word).
word metaverse Satisfying
come from

who played james David Niven Somewhat Technically correct, but misleading.
bond in casino Satisfying
royale

rationalism The dispute between rationalism and empiricism Somewhat Doesn't explicitly answer the question "what is rationalism?" but provides helpful context in understanding the
concerns the extent to which we are dependent Satisfying concept.
upon experience in our e ort to gain knowledge of
the external world.
michael jordan Actor Michael B. Jordan is 35 years old. Somewhat Some users might be interested in this answer, which is for a di erent interpretation of the query, but the question
age Satisfying was probably asking about the former basketball star, not the actor.

[asked in 2022]

how old is donald Donald Trump was 70 years old when he was Somewhat Although response is techincally correct, a more likely intent is to know Donald Trump's age in 2022, which requires
trump inaugurated as president in January 2017. Satisfying some calculation given the response.

[asked in 2022]

why does pain This theory states that pain is a function of the Somewhat The response provides a very scienti c answer to the question, which may be be too advanced and technical for
hurts balance between the information traveling into the Satisfying users. As well, the response begins with "This theory," but the theory hasn't actually been named or de ned yet.
spinal cord through large nerve bers and
information traveling into the spinal cord through
small nerve bers. Remember, large nerve bers
carry non-nociceptive information and small nerve
bers carry nociceptive information. If the relative
amount of activity is greater in large nerve bers,
there should be little or no pain.

Common Grading Mistakes Page 28 of 30


fi
ff
fi
ff
ff
fi
fi
fi
fi
fi
fi
ff
fi
Query Response Response Comments
Satisfaction
how many pounds One tonne is equivalent to: Somewhat Although this response answers the query completely, it takes us some time to nd the correct answer within the
makes a ton Satisfying list and do the necessary math to nd the answer.
• n kilograms: 1000 kilograms (kg) by de nition.

• In grams: 1000000 grams (g) or 1 megagram


(Mg).

• In pounds: Exactly 1000/0.453

• In short tons: Exactly 1/0.907 184 74 short tons


(ST), or approximately 1.102311311 ST.

• In long tons: Exactly 1/1.016 046 9088 long tons


(LT), or approximately 0.9842065276 LT.
what causes Somewhat The response contains some information on the signs of Dementia. Although it has some useful information on
memory loss Satisfying memory loss, it mainly focuses on the symptons which is not helpful to the query asking about the cause.

elements in the Somewhat No single cell in the table is complete to answer the question, although the table itself is accurate and relevant.
nitrogen family Satisfying

how many 1/2 There are 16 1/2 cups in a gallon. [1] 1 Cup is Not Satisfying The response is inaccurate and the main claim there are 16 1/2 cups in a gallon" is clearly not supported by the
cups in a gallon? equal to 1/16 of a gallon, so 16 1/2 cups is equal source
to 1 gallon.

what is the High Point University is a private liberal arts college Not Satisfying The rst citation doesnt load and the second source doesnt match the answer. This answer is inaccurate and not
acceptance rate with an acceptance rate of 77%.[1] For the grounded. It would be Not Satisfying
at high point academic year 2020-21, the acceptance rate is
university? 77.05% and the yield is 15.98%.[2]

how many planets Jupiter is the largest planet in the solar system. Not Satisfying Answers a di erent question.
are in our solar
system

how many planets nine Not Satisfying Incorrect answer. Pluto hasn't been considered a planet since 2006, so the correct answer is eight.
are in our solar
system

Common Grading Mistakes Page 29 of 30


fi
ff
fi
fi
fi
Query Response Response Comments
Satisfaction
how old is donald Donald Trump is now 70 years old. Not Satisfying The response is outdated by several years.
trump

[asked in 2022]

where do they sell As a toy, the Elf on the Shelf is benign enough. It s Not Satisfying The response uses inappropriate language ("skinny-ass doll"), which automatically makes it Not Satisfying.
elf on the shelf a skinny-ass doll, about a foot long, with a big-eyed
pixie face, a plastic head, and a felt body, on sale at
your local big box store for $29.95.

what is the Having radius r and altitude (height) h, the surface Not Satisfying The reponse doesn't answer the question.
formula for area of a right circular cylinder, oriented so that its
volume of a axis is vertical, consists of three parts:
cylinder
• the area of the top base: πr²

• the area of the bottom base: πr²

• the area of the side: 2πrh

how much Not Satisfying The response table doesn't contain information on ca eine amount and no cell can answer the question
ca eine is in an
iced cappuccino

what are the Not Satisfying The response is about the show "Sister of Cinderella" while the question is about the original story. The table does
names of not answer the question
cinderella
stepsisters

how much planets Not Satisfying The response does not contain any cell on the actual number 8, which is the answer to the question. It is also
are there inaccurate since Sun and Pluto are not planets.

Common Grading Mistakes Page 30 of 30


ff
ff

You might also like