You are on page 1of 42

Potomac v2.

4 Guidelines
Table of Contents

Contents

Table of Contents………………………………………………………………………………………………...1

Purpose…...………………………………………………………………………………………………………….3

Types of Content Comparisons……………………………………………………………………………..3

Glossary of Key Terms………………………………………………………………………………………….4

Basic Outline of the 3-Step Rating Process…………………………………………………………….5

Disqualification Criteria……………………………………………………………………………………….7

Individual Component Level Comparison……………………………………………………………...7

Tips for Dealing with Component Based Labels…………………………………………..7

Image to Image Comparison………………………………………………………………………8

Common Cases for Image Comparisons……………………………………………………...8

Caption to Caption Comparison………………………………………………………………..11

Common Cases for Captions……………………………………………………………………..11

URL to URL Comparison…………………………………………………………………………..13

Common Cases for URLs…………………………………………………………………………..13

Video to Video Comparison……………………………………………………………………...15

Common Cases for Videos………………………………………………………………………..15

Cross Component Comparison……………………………………………………………………………17

Detailed Walkthrough of Example Job………………………………………………………19

General Tips for Approaching Cross Component Jobs………………………………..19

1|Page
Common Cases for Cross Component Comparisons………………………..…………20

Overall Holistic Comparison……………………………………………………………………….………21

Rating Process…………………………………………………………………………………………21

How to Use Link to Fact Check………………………………………………………………….21

Labels for Holistic Comparison………………………………………………………………...21

Common Cases for Holistic Matching………………………………………………………..22

Tips for Holistic Matching:……………………………………………………………………….22

Examples…………………………………………………………………………………………………………...23

2|Page
Purpose

In this project you will perform pairwise content matching. For each job, you will be
presented with two examples of content. You must review each content pair, think
about the main point of each example, then tell us if they match with one another.

Your job is to tell us the true relationship between each Source and Match
Candidate. It is imperative you follow the process closely so we can measure the
precision of our systems and ultimately improve them.

Types of Content Comparisons

Same Component Comparison: Jobs where both the Match Candidate and the Source
Content share the same matching components. For instance, a Match Candidate’s
image matches to a Source Content’s image.
• Image to Image
• Caption to Caption Comparison
• URL to URL Comparison
• Video to Video Comparison

Cross Component Comparison: Jobs that have different matching components. For
instance, a Match Candidate’s overlaid text from an image matches a Source
Content’s caption

Some examples include:


• Overlaid Text to Caption
• Video to Image

3|Page
Glossary of Key Terms

Key Terms Definitions


1 Job The whole process of reviewing whether a match and Source
Content match.
2 Source Content The piece of content that has been rated by our Third-Party
Fact Checkers.
3 Match The piece of content that our systems predicted to be a
Candidate match to the Source Content.
4 Same Jobs where both the Match Candidate and the Source
Component Job Content share the same matching components. For instance,
a Match Candidate’s image matches to a Source Content’s
image.
5 Cross Jobs that have different matching components. For instance,
Component Job a Match Candidate’s overlaid text from an image matches a
Source Content’s caption.
6 Claim A statement of fact that can be supported or contradicted (or
somewhere in between).
7 Central Claim A statement of fact that is important to the content’s main
point or purpose.
8 Claim Under The main claim that was investigated and rated by the fact
Review checkers. This usually represents the “main point” of the
Source Content.
9 Link to Fact The link that serves as the source of truth for the claim
Check under review and the central claim in the Source Content.

4|Page
Rating Process Overview

Step 1
Question: Is this Content Qualified for Review?
Determine if the job meets the criteria to be rated.
• If No, disqualify and move to next job
• If Yes, see below

Step 2
What Component(s) of the Match Candidate contains the match?
Here is where you fully evaluate the job. First look at all of the components in the
Match Candidate, then evaluate the Source Content. Are there any components from
the Match Candidate that match to the Source Content? If so, select all of these
components. For instance, if you see that the Match Candidate has an Image and
Overlaid text that matches to the Image and Overlaid text of the Source Content,
select Image and Overlaid Text.

5|Page
Step 3
[Selected Component] COMPARISON: Compare the Match Candidate
[SELECTED COMPONENT] and Source Content: how would you describe the
relationship between the [Selected Component] and the Source Content?
*Note*: only the components that you selected in the previous question will appear in
the rest of the question flow in SRT. Following from the previous example, if you
selected Image and Overlaid text in Step 2, then only the Image and Overlaid Text
questions will appear. When focusing on the selected Match Candidate Component
and the matching Source Content Component, use the criteria for what’s considered
a match.

Step 4

Holistic Matching (or Overall Comparison)


• Read through the Claim Under Review
o If the claim is vague or unclear, click on the link and read the Headline
of the fact check article. The Headline is usually about the central
claim of the Source Content. If it is still unclear, skim through the
article until you have a good understanding of the fact checked claim.
• Make sure to take all components into account and look at the content
holistically (for example, take into account how the caption, the image, and
the overlaid text combined tell the intended meaning of the content).
• Determine if the claim made in the Match Candidate, matches the main claim
being made in the Source Content.

6|Page
Disqualification Criteria

If the job contains any of the following, then it should be disqualified:


• Missing Content
o The SRT asks for information on content not present in the result.
• More than one foreign language word
o Proper nouns are not considered foreign language (example: Cabo)
• Videos containing audio or subtitles in another language that ISN’T English
o Any foreign language text should be disqualified (even if it offers a
translation option)
• Corrupted Content
o Cannot access one or more of the URLs in the job as intended. They
take you to an error page, log-in screen, or paywall, broken links (404,
expired domain, server error). Note this is different than ‘missing
content’.
• The Source Content is not related to the Claim Under Review
o Quickly read the claim under review and click the link, skim the
headline of the article and skim over the first couple of sentences. If
the Source Content is not about the same subject, it does not qualify.

Once you have determined that the job meets the criteria to be rated (SRT Question
1), you will then:
• Determine how well the content matches on a component level (SRT
Question 2)
• Determine how well the content agrees holistically (SRT Question 3).

Think of these tasks as narrowly defined, standalone content-matching questions.


After the individual component-level review is complete, you will look at the bigger
picture, consider the examples holistically (or Overall Comparison), and tell us if they
share the same meaning overall.

Individual Component Level Comparison

The instructions for component-level matching are different for each content type
(image, caption, URL and video). Read through the instructions below and use the
SRT’s tooltips while rating, to refresh your memory when needed.

NOTE: As stated above, different components can match to one another (a caption
matches to overlaid text). This will be explained after general Content <> Content
Comparison (see page X for more).

Tips for Dealing with Component Based Labels

Generally, the labels are as follows:


7|Page
• Near Duplicates: There are either no differences, or if there are differences,
they do not change or have the potential to change the meaning of the
component.
• Near Matches: The differences are greater than the criteria for the "Near
Duplicates" label, but the content still conveys the same overall message.
• Do Not Match: The Match Candidate and Source Content are either about
different subject matters (and therefore they’re unrelated) or they are about
the same subject matter but clearly do not share the same meaning.
• Unsure: If you are unsure whether something should be labeled as Near
Match or Do Not Match, then choose Unsure.
NOTE: There are differences that are repeated throughout these guidelines. For
instance, Trivial Differences in Text are found in all of the content types. For the most
part, these are the exact same criteria with slight additions for photos and videos.

Image to Image Comparison

Once you have determined the job meets the criteria to be rated, you will be asked
to determine how well the images match on an individual component level.

Note: Do not consider the overall meaning of the Source or Match Candidate
when

completing these subtasks. Think of these tasks as narrowly defined, standalone


content-matching questions. After the component-level review is complete, you will
look at the bigger picture, consider the examples holistically, and tell us if they share
the same meaning overall.

Decide if the Images match

Difference between Text caption and Overlaid text

When you are evaluating this part of the question, it is important to understand the
difference between the text captions and text overlay. NOTE: these two are
considered DIFFERENT components in SRT. This means that you will evaluate
them separately in the SRT question flow.
1. Text Caption – Any text that is added to enhance the post. This can be placed
above or below the image and is not part of the image.
2. Text Overlay – Any text placed within the image box, that acts as part of the
image.

8|Page
Tip: If you hover over the image, the entire image will be outlined in blue. By doing
this, you can tell if text is apart of the image (overlaid text) or the post (caption).

Based on your visual assessment of the images, determine the correct label from the
list below:

Near Duplicates: These are identical or almost identical with the following trivial
differences:
• cropping, tint, color, brightness
• screenshots
• rotations
• stretching
• padding
• pixilation
• Trivial Imagery: These are differences that include added or subtracted
imagery that is trivial in amount, such as watermarks, arrows or circles.
NOTE: These differences do not change the meaning of the component.

Near Match: The Match Candidate and Source Content are a near match when:
• The differences are greater than the criteria for Near Duplicates, but the
components share a similar message.
They Do Not Match: The Match Candidate and Source Content do not match when:

9|Page
• They don’t refer to the same subject matter
• They make different claims
Unsure: If you are unsure whether something should be labeled as Near Match or
Do Not Match, then choose Unsure.

Common Cases for Image Comparisons


1. Photo within Albums: For an image to image comparison, an album (multiple
photos) matches to a single photo
a. How to handle: Focus on the most relevant image and use the criteria
for image to image matches.
2. Link to Image: This is when a link appears instead of a photo, but by looking
at the questions given in SRT, it’s a photo comparison
a. How to handle: If the link takes you only to an image, use the criteria
for labeling the image component.
i. Given the photos are an exact Match when you paste the link
into the browser, the photos are considered near duplicates.

10 | P a g e
Caption to Caption Comparison

Once you have determined the job meets the criteria to be rated, you will be asked
to determine how well the captions match on an individual component level.

Remember: Use the same criteria here when evaluating overlaid text.

Decide if the Captions match

Read through the text components for both the Source Content and Match Candidate
and determine how well they match. For overly long texts, read through the first five
paragraphs and skim the rest.
• You may stop reading long texts when it becomes clear the components are
unrelated/do not match.
• When coming across a job where the Match Candidate does not have an
image, but the Source Content does, compare the captions and ignore the
single image.
Based on your assessment, determine the correct description from the list of choices
below:

Near Duplicates: These are both identical or almost identical with the following
trivial differences:
• Trivial Differences in Variance: 10% variance in text.
o Example: 100 words of text with a candidate that has 10 words that
are different.
▪ The captions can be the same length, but have a 10% difference
in overall text or
▪ The captions can be 10% difference in length
• Trivial Differences in Formatting: These are differences that include:
o spacing or text formatting
o punctuation
o the addition or subtraction of citations or copyright claims
o linking strategies.
• Trivial Differences to text: These are differences that include:
o Different spellings
o Character substitutions (deliberate or not)
▪ Examples:
• ✅ The cat's fur really soft | The kats fur is very sofft
• ✅ The cat's fur really soft | Thè cätš für is vêrÿ søft
o The addition or subtraction of emojis for emphasis.
▪ Examples:
• Adding after a joke.
• Adding after a question.

11 | P a g e
NOTE: These differences do not change the meaning of the component.

Near Match: The Match Candidate and Source Content are a near match when:
• The differences are greater than the criteria for Near Duplicates, but the
components share a similar message.
They Do Not Match: The Match Candidate and Source Content do not match when:
• They don’t refer to the same subject matter
• They make different claims
Unsure: If you are unsure whether something should be labeled as Near Match or
Do Not Match, then choose Unsure.

Common Cases for Captions


1. Small remarks of agreement: This is when someone agrees with the post
before or after by adding in a couple words or sentences that adds no
additional meaning besides agreement (or disagreement depending on the
job)
a. How to handle: This is usually less than 10% of difference and does
not change the meaning of the component.
i. According to the guidelines, this would be considered a Near
Duplicate.
2. Copy and Paste: This is when the post author urges others to reshare by
copying or pasting the content.
a. How to handle: This is usually less than 10% of difference and does
not change the meaning of the component.
i. According to the guidelines, this would be considered a Near
Duplicate.
3. Attribution: This case is when someone replaces the person attributed to
saying a quote or being the subject to a story.
a. How to handle: Make sure to read the Claim Under Review first since it
matters. If the claim is about person A saying something (person A’s
quote), but in the Match Candidate, the quote is attributed to someone
else, it would be considered Do Not Match. However, if the claim is
about the content itself, the Match Candidate and the Source Content
can be a Near Duplicate depending on the difference.

12 | P a g e
URL to URL Comparison

Once you have determined the job meets the criteria to be rated, you will be asked
to determine how well the URL Links match on an individual component level.

Decide if the URL’s Match

Compare the Match Candidate and Source Content; how would you describe the
relationship between the two URL links?

1. Open each link in a new tab; then review the body of the destination page. The
SRT links are not currently clickable. You must manually copy and paste each URL
into a new browser tab. Please take great care to copy the entire URL string.
• Ignore advertisements, menu bars, comments, and related articles. Focus
solely on the title and body of the URL’s destination (blog post, article, etc.)
• If the focal point of the target page is a video (i.e. the URL is a YouTube link),
watch the first 2 minutes and skim through the remainder.
2. Read through the first 5 paragraphs, then skim through the rest.
• You may stop reading long texts when it becomes clear the components are
unrelated/do not match.
3. Decide how well the Match Candidate matches the Source Content, using the
labels outlined below.

Note: It is easy to accidentally paste the same link into both tabs: ensure you
are comparing the two distinct URLs provided.

Near Duplicates: These are both identical or almost identical with the following
trivial differences:
• Trivial Differences in Variance: 10% variance in text.
o Example: 100 words of text with a candidate that has 10 words that
are different.
▪ The captions can be the same length, but have a 10% difference
in overall text or
▪ The captions can be 10% difference in length
• Trivial Differences in Formatting: Differences that include:
o spacing or text formatting
o punctuation
o the addition or subtraction of citations or copyright claims
o linking strategies.
▪ Examples:
• The link https://nyti.ms/3cXHiDa is equivalent to
https://nytimes.com
• Spelling out a link after a hypertext reference is trivial,
you may ignore differences such as these:
o Visit this link

13 | P a g e
o Visit this link (http://nytimes.com)
• Trivial Differences to Text: Differences that include:
o Different spellings
o Character substitutions (deliberate or not)
▪ Examples:
• ✅ The cat's fur really soft | The kats fur is very sofft
• ✅ The cat's fur really soft | Thè cätš für is vêrÿ søft
o The addition or subtraction of emojis for emphasis.
▪ Examples:
• Adding after a joke.
• Adding after a question.
NOTE: These differences do not change the meaning of the component.

Near Match: The Match Candidate and Source Content are a near match when:
• The differences are greater than the criteria for Near Duplicates, but the
components share a similar message.

They Do Not Match: The Match Candidate and Source Content do not match when:
• They don’t refer to the same subject matter
• They make different claims
Unsure: If you are unsure whether something should be labeled as Near Match or
Do Not Match, then choose Unsure.

Common Cases for URLs


1. Link to a video or page where there is another content type (for example, a
link that takes you article that has a video)
a. If the URL links return video content, rather than text content, then
evaluate the job using the criteria outlined under the Video section.

14 | P a g e
Video to Video Comparison

Once you have determined the job meets the criteria to be rated, you will be asked
to determine how well the videos match on an individual component level.

Note: Do not consider the overall meaning of the Source or Match Candidate
when

completing these subtasks. Think of these tasks as narrowly defined, standalone


content-matching questions. After the component-level review is complete, you will
look at the bigger picture, consider the examples holistically, and tell us if they share
the same meaning overall.

Decide if the videos match

Near Duplicates: These are identical or almost identical videos with the following
trivial differences:
• Trivial differences in Formatting: These are differences that
o include cropping, tint, color, brightness
o screenshots
o rotation
o stretching
o padding
o pixilation
• Trivial differences in Imagery: These are differences that include added or
subtracted imagery that is trivial in amount, such as watermarks, arrows or
circles.
• Trivial difference to overlaid text: These are differences that include added or
subtracted overlay text that is trivial in amount, such as:
o Different spellings or character substitutions (deliberate or not)
▪ Examples:
• ✅ The cat's fur really soft | The kats fur is very sofft
• ✅ The cat's fur really soft | Thè cätš für is vêrÿ søft
o The addition or subtraction of emojis for emphasis
o Accurate close captioning (in English)
o Comments that don’t change the meaning of the component
▪ Examples: “Wow”, “Amazing!
• Trivial difference in Content or Length: Differences include:
o Difference of the length of the videos, either a couple seconds
difference or up to 10% of a difference in seconds, whatever is longer,
between the candidate video and the source video.
▪ Example: a 3-minute video with a candidate that is ~18
seconds different
▪ Example: 30-second video with a 36-second video
o 10% difference in content

15 | P a g e
▪ Example: The length of the video is the same, but 10% of it is
different content
• Trivial Differences in Audio:
o The audio is compressed (playback is of slightly lower or higher
quality)
o Has a different volume level
o Music has been changed, added, or removed in a way that does not
affect meaning
o The audio track has been silenced without affecting meaning
o The audio tracks feature different speakers, but the original meaning
is not affected
o The audio contains sound-effects that do not affect the meaning of the
video
NOTE: These differences do not change the meaning of the component.

Near Match: The Match Candidate and Source Content are near matches when:
o The differences are greater than the criteria for Near Duplicates, but
the components share a similar message.

They Do Not Match: The Match Candidate and Source Content do not match when:
• They don’t refer to the same subject matter
• They make different claims
• Differences in Text or Formatting that changes the meaning of the
component (swapped from near match example)
Unsure: If you are unsure whether something should be labeled as Near Match or
Do Not Match, then choose Unsure.

Common Cases for Videos


1. Differences in Length: Right off the bat, you can look at the length of both
videos. A common case in the queues is when two videos are similar but
differ in length.
a. How to handle: Referring to the guidelines, if the difference in content
or length is up to 10%, it is considered Near Duplicates.
2. Edited Videos: Videos can appear to be identical or very similar, but they
can also be edited to mislead the viewers.
a. How to handle: If the videos are edited in a manner that changes the
meaning of the video with respect to the claim under review, it cannot
be a Near Duplicate.
i. Choose Do Not Match

16 | P a g e
Cross Component Comparison

Often, when going through jobs in the queue, you will come across jobs where the
Match Candidates and Source Content match, but match with different components.
These jobs are considered Cross Component Jobs. Don’t fret, this is expected! The
following examples encompass these types of jobs in more details:
1. The Match Candidate and Source Content have the same components, but
different components are considered a match according to criteria outlined
in these guidelines.
a. Example: Both the Match Candidate and the Source Content have a
caption and an image. However, the Match Candidate’s caption
matches to the Source Content’s overlaid text within the image.
2. The Match Candidate and Source Content have different components
altogether, but those different components are considered a match according
to criteria outlined in these guidelines.
a. Example: A Match Candidate consists of only an image. A Source
Content consists of only a video. Although they are different
components and don’t share any components, the image in the Match
Candidate matches one of the images within the Source Content's
video.

Example Job [Caption → Overlaid Text]


i. Here’s an example job where the Match Candidate’s Caption
matches the Source Content’s Overlaid Text. When first going
through the job, quickly glance at all components of both the
Match Candidate and the Source Content. Remember that
you’re evaluating the the job with respect to the Match
Candidate and although different components can match, you
would treat the components that match the same as the Same
Component jobs. In this case, if the there is less than 10%
difference in text between the Match Candidate Caption (text)
and the Source Content’s Overlaid Text (text), it would be
considered a Near Duplicate.

17 | P a g e
18 | P a g e
Detailed walk through of example job

Question Is this Content Qualified for Review?


Answer: According to the Disqualification Criteria (question 1), this job is qualified
for review. Continue to the next question.

What Component(s) of the Match Candidate contains the match?


Answer: As you can recall from the rating process in the beginning of the guidelines,
“First look at all of the components in the Match Candidate, then evaluate the Source
Content. Are there any components from the Match Candidate that match to the
Source Content.” In this case, the Overlaid text of the Source Content matches the
Match Candidates caption. Choose Caption. In the next question, you will be asked
what type of match it is. According to the guidelines, these two text components are
considered Near Duplicates.

CAPTION MATCH TYPE: What component from the Source Content does the
match candidate's caption match to?
Answer: As discussed above, the Caption from the Match Candidate matches to the
Source Content’s Overlaid text. Choose Overlaid Text.

OVERALL MEANING COMPARISON: Consider the intended meaning of the


Match Candidate as a whole: how does it align with the Source Content, with
respect to the claim under review?
Answer: Looking at both contents with respect to the claim under review, they both
agree with one another. There is additional text in the Source Content’s Caption, but
it doesn’t add or change the meaning of the post overall, thus, choose Agrees.

General Tips for Approaching Cross Component Jobs


• Be keen to observe any combination of components that can potentially
match. For instance, since overlaid text and captions are both text, they can
match according to the criteria in these guidelines. Something not as
straightforward would be a video (with audio) compared to a caption. In this
case, remember that the audio can be found in the video transcript, and the
audio can potentially match the other content’s caption.
• Remember to use the same criteria for regular jobs to determine if different
components match to one another.
o With each component (photo, text, video) there are certain matching
criteria like differences in text, length, overall content, sound, etc. Use
these criteria just as you would with jobs that are same-component
matches.

19 | P a g e
Common Cases for Cross Component Comparisons

Overlaid text Matches to Caption: These are jobs where the Overlaid text of an
Image matches to a caption.
How to Handle: Use the matching criteria for captions described above. Moreover,
this criteria includes: trivial differences in variance, formatting, and text.

Video Matches to a still image: A video slideshow of images matches to a still


image.
How to Handle: Treat the image that’s identical or very similar (if there are any) in
the video slide show as a regular image. This makes it a lot easier to determine if
something is a match.

20 | P a g e
Overall Holistic Comparison

Looking at all of the components together, you will then decide how well the Match
Candidate aligns with the Source Content with respect to the “central claim” being
expressed.

When you are evaluating this part of the question, it is important to understand the
“central claim” and “claim under review”.

Rating Process
1. Review the Claim Under Review
a. If the Claim Under Review is ambiguous, vague, or not well written,
click the link to skim the headline and the first couple of sentences of
the Fact Check Article.
2. Identify the central claim for the Source content
a. Recall: If the Source Content does not relate to the Claim Under
Review, disqualify the job.
3. Make sure to take all components into account and look at the content
holistically (for example, take into account how the caption, the image, and
the overlaid text combined tell the intended meaning of the content).
4. Determine if the claim made in the Match Candidate matches the main claim
being made in the Source Content.
How to Use Link to Fact Check

The link to the Fact Check Article is the source of truth for the associated Claim
Under Review or the Central Claim within the Source Content. The link to the Fact
Check Article should first be used to determine if it is related to the Source Content
and the Claim Under Review. Afterwards, it should be used if you need further
clarification about the Claim Under Review while determining if the Source Content
and Match Candidate are holistic matches.

Labels for Holistic Comparison


• Agrees: The Match candidate and the Source Content agrees with the Source
Content’s Central Claim.
• Does Not Agree: The Match Candidate and Source Content refer to same
subject matter, but they disagree with respect to the central claim. In these
cases, Source Content and Match Candidate are mutually exclusive: i.e. they
cannot both be true.
• Unrelated: If the main subject or main claims of the Match Candidate and
Source Content are not the same.
• Unsure: The Match Candidate does not take an obvious position on the Claim
Under Review or too much information or context has been taken away or
added.

21 | P a g e
Common Cases for Holistic Matching
1. Debunking: This is when a Match candidate debunks the claim made in the
Source Content or vice versa.
a. How to handle: If the Match Candidate debunks the Source Content’s
central claim, it therefore has a different meaning and should be
labeled as Do Not Match.
2. Ambiguous Meaning:
a. How to handle: If it is unclear whether the Match Candidate agrees
with the Source Content or vice versa, choose Unsure.

Tips for Holistic Matching:


• Two posts may appear to be identical, but with the addition of one or a
couple of words, they have completely different meanings.
o Example:
▪ Caption A: “you can cure covid in the following ways: taking
magnesium, drinking apple juice...”
▪ Caption B: “The following is FALSE.... you can cure covid in the
following ways: taking magnesium, drinking apple juice...”
• Be sure that you understand the Claim Under Review.
o NOTE: if the Claim Under Review that is written is vague, open the
link under the Claim Under Review and find out what part of the
article the Source Content is focused on.

22 | P a g e
Examples
Same Component comparison Jobs

1) Trivial Differences in Formatting: Text

Spacing and punctuation: This is an example there’s a 10% variance in text,


however it the added sentence does not change the meaning of the text/ component.

The different spacing and punctuation would be considered a trivial difference (and
not change the meaning of the component).

23 | P a g e
2) Trivial Differences in Formatting: Text

According to the guidelines, if the difference in length is up to a 10% difference or a


couple of words (whichever is longer), and the meaning remains the same, it is
considered Near Duplicate.

In this example, the additional sentence in the end of the Match candidate does not
make up more than 10% of the total text and it does not change the meaning of the
text. Therefore, it would be considered a Near Duplicate.

24 | P a g e
3) Multiple Differences in Text

Spacing and punctuation: The spacing and punctuation are different but are trivial
since it doesn’t affect the meaning of the text.

25 | P a g e
Difference in Length: The additional sentences in the beginning of the Match
Candidate do not change the meaning

Attribution: The quotation marks are not correctly used in the Match Candidate,
but it’s still clear the statement is attributed to Trey Gowdy. Therefore, both of the
captions are attributed to the same person.

Copy and Paste: The post author adds at the end to “Copy and paste if you dare.”
This does not affect the meaning.

Label: Near Duplicate

Differences for Images

4) Identical Images

Exact same: these images are exactly the same or near-exactly the same upon
observation.

26 | P a g e
Component Label: Near Duplicate

27 | P a g e
5) Identical Images

Exact same: these images are exactly the same or near-exactly the same upon
observation.

Component Label: Near Duplicate

28 | P a g e
6) Trivial Differences in Formatting: Image

Trivial formatting: this is trivial because the full text is included in both images and
any cropping does not change the image’s meaning. There are no substantive
changes as cropping the bottom only removes whitespace and does not change the
user’s understanding of the content.

Component Label: Near Duplicate

29 | P a g e
7) Trivial Differences in Formatting: Image

Trivial formatting: Buffer or whitespace added that does not change the meaning
or substance of the image.

Component Label: Near Duplicate

30 | P a g e
8) Trivial Differences in Formatting: Image

Trivial overlay/text: the overlay does not make a substantive change to the
meaning of image. The added overlays are related to the original text and are not
changing user understanding. If the overlays were a politician’s face, company
slogan, etc. then the meaning would be changed.

Component Label: Near Duplicate

31 | P a g e
9) Trivial Differences in Formatting: Image

Trivial formatting: Screenshots of content that do not show additional comments


that change the meaning of the content are considered Near Duplicates ((if they are
the same picture with trivial differences like cropping or other differences described
in the Near Duplicate criteria).

Component Label: Near Duplicate

32 | P a g e
10) Trivial Differences in Formatting: Images

Trivial formatting: Screenshots of content that do not show additional comments


that change the meaning of the content are considered Near Duplicates (if they are
the same picture with trivial differences like cropping or other differences described
in the Near Duplicate criteria).

Component Label: Near Duplicate

33 | P a g e
11) Trivial Differences in Formatting: Images

Trivial formatting: cropping and trivial overlay that don’t change the meaning or
message of the image. The reshare text (generic account) is not indicating
endorsement or comment by a person of interest that changes the image’s message.

Component Label: Near Duplicate

34 | P a g e
12) Substantive Differences in Formatting: Image

Substantive Overlay/Text: If we ignore the fact that the text is non-English, the
overlays and text on the image add a value judgment and change the meaning of the
photo. This particular example is a quasi-political statement added to the original
image.

Component Label: Do Not Match

35 | P a g e
13) Substantive Differences in Formatting: Image

Substantive formatting: the cropping removes key elements of the image that
change its meaning. Cropping to remove an individual that is key to the context
(here holding the chain) influences user perception and does not qualify as a match.

[watermark wouldn’t have done anything]

Component Label: Do Not Match

36 | P a g e
14) Substantive Differences in Formatting: Image

Substantive formatting: The cropping changes the image meaning by removing


context and words integral to the message (“Corona virus”). The user would not
have the same understanding when removing key parts of the message.

Component Label: Do Not Match

37 | P a g e
15) Common Cases for Captions: Small Remarks of Agreement

Small remarks of agreement: This is when someone agrees with the post before or
after by adding in a (this is usually less than 10% of difference in length) and should
NOT change the meaning
• Example: In this case, the first couple sentences give an additional sense of
agreement to the caption that follows but does not change or add to the
meaning.
Label: Near Duplicate

38 | P a g e
39 | P a g e
16) Common Cases for Captions: Copy and Paste

Copy and Paste:

This is when the post author urges others to reshare by copying or pasting the
content.

According to the guidelines, this would be considered a Near Duplicate.

17) Common Cases for Images: Link to Image

Link to Image: This is when a link appears instead of a photo, but by looking at the
questions given in SRT, it’s a photo comparison.

Given the photos are the same photos with trivial differences (cropping) when you
paste the link into the browser, the photos are considered Near Duplicates

40 | P a g e
18) Common Cases for Images: Photo to Album (Multiple photos) Comparison

41 | P a g e
Cross Component JOBs

19) Caption to Overlaid text

Though the Source Content here is a photo, it’s a Near Duplicate match according to
the caption to caption matching criteria. Remember, overlaid text would be treated
with the same criteria as a caption for cross component jobs.

42 | P a g e

You might also like