Professional Documents
Culture Documents
Version 1.1
Introduction 2
Overview of Task 3
Implicit Questions 4
Types of Responses 17
Step 5: Groundedness 23
• One response containing a few lines of text that answers the question, or two responses presented side-by-side.
• One or more response source(s): the web page(s) that the response comes from. You can visit a source by clicking the superscript number or the URL
below the response (see below for an example). Note that the same source number in di erent sides may refer to di erent sources.
1. Understand the Question. Using popular web search engines, make sure you understand what the question is asking and what it refers to.
2. Validate the Question. Check to make sure there s nothing about the question that makes it impossible or inappropriate to grade the answer. (If
you decide the question is invalid, you will skip the remaining steps.)
3. Validate each Response. Before you can grade the satisfaction of the Response, you ll indicate whether there are any problems that would
prevent you from judging it.
4. Assess the factual accuracy of each response. Using web search engines and trustworthy sources, you ll determine whether the Response is
accurate.
5. Assess the groundedness of each response. All responses should have their facts grounded in cited sources. We do not want the responses
providing unsupported statements.
6. Assign a Satisfaction Grade to each Response. Using the scale described below, you ll indicate how satisfying the response is given the user s
question.
7. (For tasks with two Responses) Indicate which Response you d prefer. Using the scale described below, you ll indicate which Response is better
given the user s question.
fi
Step 1: Understand the Query
The rst step is to make sure you understand what the user is asking. If you don t understand what the query refers to, you won t be able to
determine if the answer is correct. As a simple example, if the query was who played m and you didn t know that m was the name of a recurring
character in the James Bond movie franchise, you might think that the question was nonsensical.
Also, many queries can refer to multiple things that have the same name. It s important to know which of these meanings is the most likely
interpretation. For example, there are many people named Michael Jordan. If the question was who is Michael Jordan? the most likely interpretation
is the former basketball star. But if the question was who is Michael Jordan a professor? the most likely interpretation is the machine learning expert.
Use the links to popular search engines to make sure you understand the primary interpretation of the query.
Implicit Questions
Note that even if the query isn t explicitly phrased as a question, there is often one or more obvious interpretations as a question, especially if the
query is a person, place, or task. For example:
fi
Topic is Apple Product Help seeking help about how to use an Apple
• “how to turn o my iPhone”.
product. Examples:
Mark this choice if the query doesn t seem to
• Go Chargers
be asking a question, even implicitly, or is
• lweriojsdf,
Query has No Clear Question incomplete, or so garbled or full of misspellings
• is biden veto
and grammatical errors that you don t know
• how much does"
what it s asking, or has empty content.
fi
ff
ff
fi
▪ Locale is en-US, query is Japanese phrase
出る釘は打たれる. ̶ Wrong Language.
▪ Locale is de-DE, query is French phrase
plus ça change, plus c'est la même chose. ̶
Wrong Language.
Queries must be either in the locale language, ▪ Locale is ja-JP, query is uncertainty
or in English, or be a well-known entity such as principle. ̶ This is NOT wrong language,
Wrong Language
a song or movie. If that is not the case, mark it because it s English, even though it s not the
as Wrong Language. locale language.
▪ Locale is en-US, query is Spanish title
Despacito, ̶ This is NOT wrong language,
because it s the name of a famous song that s
well known in this locale.
Query Validation
⚠ Note that if the question is invalid ̶ that is, if you select any
choice(s) other than None of the above ̶ you ll skip the remaining
steps and will be done with this query and response.
Ratings Description
A response is in the wrong language if all of the following statements are true:
• The response is NOT in English.
Wrong Language • The response is NOT in the language of the user s locale.
• The query and response are NOT in the same language.
Response Validation
There are di erent types of accuracy in that the question might ask a precise question with single veri able answer (e.g. Who won the Soccer World
Cup 2022 Finals? ). However, sometimes the question might have had di erent answer at di erent instances of time (e.g. Who was the winner of the
Oscars ) and the answer would need to take into account the most obvious interpretation (latest year) or qualify its answer.
At times it is very di cult to rate the accuracy of some responses because of lack of veri able sources (e.g. question about games, or responses
found on user generated sites such as Quora and Reddit).
⚠ Medical Accuracy
⚠ Small Disagreements
Medical accuracy should be judged using trusted health websites
For queries asking for a number, a date, or a time period whose such as MayoClinic, NHS, NIH, Medline, CDC, Merck Manuals. This
exact value is hard to verify, di erent sources may have some small and this website have links to trusted health sites.
disagreement. An accurate response to these queries should be in
rough agreement across multiple trustworthy websites. Examples:
• duration of wars in history,
• how old is the oldest tree? ,
• what is the population in Los Angeles metropolitan?
ff
ffi
ff
ff
fi
fi
ff
fi
Accuracy Description Examples
• Factual questions related to entities such as age of
someone , or who did something should be based on
most recent data to be considered.
The response correctly and completely answers the
Accurate dominant interpretation of the query as veri ed by • For example, the accurate answer to Who owns range
multiple trustworthy sites. rover is Tata Motors since they are the current
owners.
Accuracy Ratings
• an isosceles triangle has two equal sides and one unequal side. [1] https://www.omnicalculator.com/math/triangle-
height#:~:text=An%20isosceles%20triangle%20is%20a%20triangle%20with
%20two%20sides%20of%20equal%20length.
In this case only the last claim is supported by the cited source ̶ there is nothing in the source that says isosceles triangles have three sides.
Groundedness
A claim is grounded if all the information it conveys is supported by the cited source. This is not the same as accuracy ̶
claims can be grounded but inaccurate, or accurate but ungrounded.
You should check the referenced sources by clicking the superscript numbered URL links, and choose from the following options.
Accuracy Ratings
• what are the di erent generations of mustangs? , the answer contains The 7th generation is available as a coupe and convertible with powertrains
ranging from a turbo-four to a V-8, taken from the source. The actual text however is The Mustang will enter its seventh generation . Searching for
the exact text ( 7th generation ) will not yield a match.
• Another example is the answer to is a passport needed to travel to bahamas? which is A passport is required to travel to the Bahamas.¹ All
Americans, including children and infants, must have a valid passport when traveling by air.² Even if ying through the US, a passport is still
mandatory.³ .
• The rst citation implicitly says a passport is required though the text is not literally there (hence this claim is Grounded).
• The last citation is a 404 and hence Not Grounded
• Taken together this response is Partially Grounded
fl
Step 6: How Satisfying is the Response?
In this step, you ll focus on the response, and determine how well it satis es the user s information need. If there are more than one response, you
should assess each of them.
Relatedness Is the information exactly what the user was asking about, or is it answering a slightly di erent question?
Correctness Is the information provided in the response correct as of the current date?
Completeness Does the information provided in the response answer the user s question completely, or only partially?
Context Does the response contain enough context to make the answer as useful as possible?
Can the user easily identify the useful information in the response? We wish to provide our users with answers that are address the question without
Conciseness super uous wording. Responses should be concise: they should not include extraneous text that is not needed to answer the question (for example
repeating text, restating phrases already mentioned in the text, or adding extra facts that do not provide additional insight).
Does the response read like an answer to the question? Is the information presented in a way that most users can understand it? Is the response
Readability
grammatically correct? Do the multiple sentences in the response logically t together to form a coherent answer?
(If any reference is present) Is a statement in the response supported by the reference, or is it irrelevant to or even contradicting what is cited? Check the
Grounding
cited page (if present) and see how grounded the response is.
Principles of Satisfaction
fl
fi
ff
fi
Use all the above to help you choose your rating. For example, consider the question Who played James Bond in Casino Royale? The answer David
Niven is technically correct, related, and understandable, but it is incomplete and missing important context: Niven played Bond in a 1967 version of
Casino Royale that was a parody of spy movies, not the better-known 2006 lm starring Daniel Craig. While Daniel Craig would be a more satisfying
answer, better still would be an answer that mentioned both actors and explained which lms they were in.
Types of Responses
• Textual: a passage of text that answers the questions. Sometimes these will be associated with a short answer which is a smaller snippet from the
passage
• List: a list of answers such as the symptoms of strep throat. Please scan the list to nd the answer the question (note, the correct answer may not
List Answer
be present)
• Table: a tabular representation that contains the answer. Please scan the table for the answer (if it present).
Table answer
fi
fi
fi
Grading Scale for Responses
You ll use the following scale to grade the response with some speci c guidance for di erent response types:
• Highly Satisfying. Almost all users would want to see this response.
• Satisfying. Many users would be interested in seeing this response.
• Somewhat Satisfying. Some users may nd this response useful, but it s probably not what most searchers were looking for.
• Not Satisfying. The response provides no value to the user and should not be shown. You should rate a response Not Satisfying if it is
Inappropriate.
Now let s look at more details about when to choose each grade. In each case, you ll choose that grade if any of the situations listed apply. The
guidance on when to choose ratings applies to all response types, where applicable we have called our response speci c guidance.
fi
fi
ff
fi
Groundedness Description Notes Response Speci c Guidance (in
addition to the above)
Satisfying • Leaves out some details or useful context. (If references are present) A minor/supporting • For list-based responses, the answer is
claim (i.e. a claim that is not essential to answering not as direct as possible but you can
• Includes a lot of other information that makes it harder to nd the
the question) is ungrounded, but still accurate. The quickly nd the answer within the list,
answer.
main claim however needs to be grounded. For and other entries in the list may add
• Contains some grammatical, spelling, or formatting errors, but is example, in response to is water colorless useful context to the answer.
overall understandable to most users.
• For table-based responses, no cell from
• Is more simpli ed or more complicated (and often more wordy) than the table can directly answer the
Water has a slight blue color that becomes deeper as
it needs to be. the thickness of the observed sample increases[1] question, but there is a cell which might
This is because red, orange and yellow (long answer a simpli ed or generalized
• Answers a slightly more general question.
wavelength light) are absorbed more strongly by water version of the query.
• Uses strongly opinionated languages that sounds like an than is blue (short wavelength light)[2] When white
advertisement. light from the sun enters the ocean, it is mostly the
blue that gets returned[3] , which is why the ocean
appears blue.
fi
fi
fi
fi
fi
fi
fi
Groundedness Description Notes Response Speci c Guidance (in
addition to the above)
Not Satisfying • Has nothing to do with the query (e.g. answers a di erent question).
Accuracy Ratings
Examples
fi
fi
ffi
Step 5: Groundedness
Examples of Groundedness
Response
Query Response Groundednes Comments
s
Confederate soldiers are buried in various locations
where are the
including Arlington National Cemetery, [1] All the listed burial locations are mentioned in the cited sources, making the response Fully
confederate Fully Grounded
Lewisburg, West Virginia, [2] Oak Hill [3] and Oak Grounded.
soldiers buried?
Woods Cemetery in Chicago [4].
Yes, Emily Blunt and John Krasinski have been
are emily blunt and married since 2010. [1]They got married in an
john krasinski intimate ceremony at George Clooney's house in Fully Grounded All the claims are mentioned in the cited sources.
married? Lake Como, Italy on July 10, 2010 [2] and have two
daughters together, Hazel and Violet. [3]
fi
fi
Response
Query Response Groundednes Comments
s
A passport is required to travel to the Bahamas.[1]
is a passport All Americans, including children and infants, must
Partially The answer is provided in the rst claim and is supported in the citation (though not literally). The
needed to travel to have a valid passport when traveling by air.[2] Even
Grounded last claim is a 404, but is a minor claim. Because this is a 404, this response would be Satisfying.
bahamas? if ying through the US, a passport is still
mandatory.[3]
1 pint is equal to
One pint is equivalent to 16 ounces. [1] Not Grounded The source is an unrelated website.
oz?
12pm cst is what The claim is not supported by the source. Instead, the source suggests that 12pm CST is 10am
12pm CST is 4pm PST. [1] Not Grounded
time pst? PST.
where did the The term "metaverse" has its origins in the 1992 Highly Satisfying The response is not perfectly comprehensive (for instance, it doesn't mention the author of the novel), but it does
word metaverse science ction novel Snow Crash as a portmanteau answer the question, and is clearly a result almost all users would want to see.
come from of "meta" and "universe."
how many planets There are eight planets: Mercury, Venus, Earth, Highly Satisfying The response provides useful context (the names of the planets) to completely answer the question. It is also up-to-
are in our solar Mars, Jupiter, Saturn, Uranus, and Neptune. date (not including Pluto).
system
who played james Daniel Craig starred as Bond in Casino Royale Highly Satisfying Answers the question completely and provides meaningful context.
bond in casino (2006), while David Niven played Bond in the 1967
royale lm of the same name.
rationalism Rationalism is the philosophical view that Highly Satisfying Clear answer, explained as simply as possible without being oversimpli ed.
knowledge is acquired through reason, without the
aid of the senses.
je bezos Je Bezos was reportedly involved in a fatal Highly Satisfying Provides a direct, complete, accurate, and concise answer.
helicopter crash helicopter crash that gave him minor head
lacerations in the year 2003.
fi
Query Response Response Comments
Satisfaction
what are some Signs and symptoms of strep throat can include: Highly Satisfying Provides a direct, complete, accurate and concise answer. Every item in the list contains highly relevant information.
signs of strep • Throat pain that usually comes on quickly
throat • Painful swallowing
• Red and swollen tonsils, sometimes with white
patches or streaks of pus
• Tiny red spots on the area at the back of the roof
of the mouth (soft or hard palate)
• Swollen, tender lymph nodes in your neck
• Fever
• Headache
• Rash
- Nausea or vomiting, especially in younger children
- Body aches
what's the tallest Highly Satisfying The cell "One World Trade Center" provides a direct, complete, concise answer. The information in the table is
building in new accurate and highly relevant.
york city
what's the Highly Satisfying The cell "$3,000‒$5,000" under Invisalign provides a direct, coplete, and consice answer. The table itself is highly
average cost for relevant.
invisalign
how the titanic Highly Satisfying The response contains "Collision with iceberg on 14 April" under "Cause" which answers the query correctly and
sink concisely.
who voices pidge Pidge in Voltron is voiced by Bex Taylor-Klaus. [1] Satisfying The response correctly answers the question. The 2nd claim, despite being accurate, does not have a citation. Given
in voltron? She has also voiced Taylor in Hell Fest. that this is a minor claim, the response is Satisfying.
is a passport A passport is required to travel to the Bahamas.[1] Satisfying The answer is accurately provided in the rst claim and is supported in the citation (though not literally). The last
needed to travel All Americans, including children and infants, must claim is a 404, but is a minor claim. Because this is a 404, this response would be Satisfying.
to bahamas? have a valid passport when traveling by air.[2] Even
if ying through the US, a passport is still
mandatory.[3]
how many planets Most large objects in orbit around the Sun lie near Satisfying Contains the correct answer, but user has to read through a bunch of other information to nd it.
are in our solar the plane of Earth's orbit, known as the ecliptic. All
system eight planets are very close to the ecliptic, whereas
comets and Kuiper belt objects are frequently at
signi cantly greater angles to it.
beagle litter size The average litter for a domestic dog consists of Satisfying Answers a slightly more general question (for dogs rather than beagles), but the answer is still valid.
about six puppies
is falafel Falafel are vegan, nutritious and super delicious. Accurately describes falafel as vegan but written in an informal, opinionated way.
vegetarian
rationalism In philosophy, rationalism is the epistemological Satisfying Answers the question in a generally understandable way but uses specialized terms like "epistemological" that many
view that "regards reason as the chief source and users would not be familiar with.
test of knowledge."
which states Salem Salem , capital of Oregon, U.S., and the seat Satisfying Provides an accurate answer to the question. However, it is not a complete sentence, and the word "Salem" appears
capital is salem (1849) of Marion county. twice in the beginning of the response.
what area code is The state of Mississippi is served by the following Satisfying Completely answers the question and is accurate but only the rst entry of the list answers the question.
228 area codes:
who is the voice Satisfying The cell under "Voices of" contains the correct answer "Tommy Davidson" but it answers a more generic query "who
of oscar proud are the voice of the proud family" instead of the speci c "Oscar Proud" person
from the proud
family
what is peoa ut Peoa, Utah has a ZIP code of 84061. It is located in Somewhat The essential claim to answer the question is accurate and grounded. However, the remaining two minor claims,
zip code? northern Utah and covers a slightly less than Satisfying despite being correct, are not grounded by any cited source. It also provides extraeous infomation that is not
average land area compared to other ZIP codes in necessary to answer the question..
the United States. The sales tax rate of 7.15%
applies in the zip code 84061. [1]
where did the Neil Stephenson's science ction novel Snow Crash Somewhat Answers the question completely, but misspells the author's rst name.
word metaverse Satisfying
come from
where did the Neal Stephenson Somewhat Answers a slightly di erent question (who came up with the word).
word metaverse Satisfying
come from
who played james David Niven Somewhat Technically correct, but misleading.
bond in casino Satisfying
royale
rationalism The dispute between rationalism and empiricism Somewhat Doesn't explicitly answer the question "what is rationalism?" but provides helpful context in understanding the
concerns the extent to which we are dependent Satisfying concept.
upon experience in our e ort to gain knowledge of
the external world.
michael jordan Actor Michael B. Jordan is 35 years old. Somewhat Some users might be interested in this answer, which is for a di erent interpretation of the query, but the question
age Satisfying was probably asking about the former basketball star, not the actor.
[asked in 2022]
how old is donald Donald Trump was 70 years old when he was Somewhat Although response is techincally correct, a more likely intent is to know Donald Trump's age in 2022, which requires
trump inaugurated as president in January 2017. Satisfying some calculation given the response.
[asked in 2022]
why does pain This theory states that pain is a function of the Somewhat The response provides a very scienti c answer to the question, which may be be too advanced and technical for
hurts balance between the information traveling into the Satisfying users. As well, the response begins with "This theory," but the theory hasn't actually been named or de ned yet.
spinal cord through large nerve bers and
information traveling into the spinal cord through
small nerve bers. Remember, large nerve bers
carry non-nociceptive information and small nerve
bers carry nociceptive information. If the relative
amount of activity is greater in large nerve bers,
there should be little or no pain.
elements in the Somewhat No single cell in the table is complete to answer the question, although the table itself is accurate and relevant.
nitrogen family Satisfying
how many 1/2 There are 16 1/2 cups in a gallon. [1] 1 Cup is Not Satisfying The response is inaccurate and the main claim there are 16 1/2 cups in a gallon" is clearly not supported by the
cups in a gallon? equal to 1/16 of a gallon, so 16 1/2 cups is equal source
to 1 gallon.
what is the High Point University is a private liberal arts college Not Satisfying The rst citation doesnt load and the second source doesnt match the answer. This answer is inaccurate and not
acceptance rate with an acceptance rate of 77%.[1] For the grounded. It would be Not Satisfying
at high point academic year 2020-21, the acceptance rate is
university? 77.05% and the yield is 15.98%.[2]
how many planets Jupiter is the largest planet in the solar system. Not Satisfying Answers a di erent question.
are in our solar
system
how many planets nine Not Satisfying Incorrect answer. Pluto hasn't been considered a planet since 2006, so the correct answer is eight.
are in our solar
system
[asked in 2022]
where do they sell As a toy, the Elf on the Shelf is benign enough. It s Not Satisfying The response uses inappropriate language ("skinny-ass doll"), which automatically makes it Not Satisfying.
elf on the shelf a skinny-ass doll, about a foot long, with a big-eyed
pixie face, a plastic head, and a felt body, on sale at
your local big box store for $29.95.
what is the Having radius r and altitude (height) h, the surface Not Satisfying The reponse doesn't answer the question.
formula for area of a right circular cylinder, oriented so that its
volume of a axis is vertical, consists of three parts:
cylinder
• the area of the top base: πr²
how much Not Satisfying The response table doesn't contain information on ca eine amount and no cell can answer the question
ca eine is in an
iced cappuccino
what are the Not Satisfying The response is about the show "Sister of Cinderella" while the question is about the original story. The table does
names of not answer the question
cinderella
stepsisters
how much planets Not Satisfying The response does not contain any cell on the actual number 8, which is the answer to the question. It is also
are there inaccurate since Sun and Pluto are not planets.