You are on page 1of 27

BING – Junk Labeling

THIS DOCUMENT CONTAINS CONFIDENTIAL AND PROPRIETARY INFORMATION BELONGING TO MICROSOFT


CORPORATION.
THE RECIPIENT UNDERSTANDS AND AGREES THAT THESE MATERIALS AND THE INFORMATION CONTAINED HEREIN
MAY NOT BE USED OR DISCLOSED WITHOUT THE PRIOR WRITTEN CONSENT OF MICROSOFT CORPORATION.

Contents
INTRODUCTION TO JUNK LABELING ........................................................................................................................ 2
DATA LIST/FILE DIRECTORY ................................................................................................................................. 3
SOFT 404 ............................................................................................................................................................ 5
EMPTY SERP ....................................................................................................................................................... 8
OTHER ERROR MESSAGE ..................................................................................................................................... 9
WEB SEARCH RESULT PAGE............................................................................................................................... 10
LOW OR NO CONTENT ...................................................................................................................................... 13
UNDER CONSTRUCTION/ABANDONED .............................................................................................................. 14
PARKED DOMAIN/BOILERPLATE ........................................................................................................................ 15
ADS ONLY ......................................................................................................................................................... 17
MARKUP FILE .................................................................................................................................................... 17
REDIRECTION.................................................................................................................................................... 18
SOFT REDIRECTION ........................................................................................................................................... 19
ADDITIONAL CLARIFICATIONS ............................................................................................................................... 22
NOT SECURE/ NOT PRIVATE PAGES ................................................................................................................... 22
NOT JUNK LOGIN PAGES ................................................................................................................................... 23
OUT OF STOCK/NO LONGER AVAILABLE PAGES ................................................................................................. 25
INTRODUCTION TO JUNK LABELING

In this task, you will be judging whether a page is Junk or Not Junk.
A page is Not Junk if it is clear that there is some value to a user on a particular topic or in a particular
context. Most web pages you see on a search engine such as Bing or Google would be considered Not
Junk.
A junk page is a page that imparts little or no value to any user on any topic. There is absolutely no
topic/ query for which that page would be a good result.
This task is query-independent. You are *not* judging that document’s relevance to a query. A
document may be a horrible result for a particular query, but still not be junk because it is a good result
for some other query.

In making a determination if a page is junk, the document must be judged solely on the merit of its
content, *not* which site it belongs to. A document under a well-respected site like cnn.com could just
as easily end up being Junk as a document from some little known abcde.com site. For example,
whether big or small, sites frequently employ mechanisms to redirect a particular URL (expired content,
missing data, etc.) to an error page within their domain; these pages provide no useful content to the
user and should not be shown at all by the search engine.

If a page falls under one or more of these categories, it is junk.


DATA LIST/FILE DIRECTORY

Majority of the page just holds a list of data (e.g. phone numbers, word list, digit list, email list, etc.). A user does
not benefit because they cannot make sense of such a list. Some data list are just dumps from some applications,
others are generated with SEO purpose (e.g. phone number list page). Also included are contents of a server
indexed as a web page (Often will have "Index Of /" at the top of the page, folder and file images arrayed in a list).

Data List – JUNK: This page contains huge amount of phone numbers for SEO purposes
Data List – JUNK: This text file contains some experiment data which regular users cannot make sense of
without context.

File Directory – JUNK: This page is an index of files for the website and is considered a File Directory.
SOFT 404

Page contains an error message to report the whole page or the major part of the page is missing/ ‘down.’ For
example, if the page is about selling a product, then the page is considered a Soft 404 in case the following error
message is shown “this product is removed from our shop”. Another example is the YouTube page with the video
displaying “This video has been removed” or similar error messages.

Soft 404 – JUNK: Page displays a message stating, “Sorry, the page you’re looking for cannot be found”
and content of the page is missing.
Soft 404 – JUNK: Page displays the message, “There are no Press Releases for RBS^F” and no other
content is available in the main content.
Soft 404 – JUNK: Page displays the message, “Sorry, We cannot find the page you are looking for.” And
the main content of the page is missing. Even though the navigational links are there at the bottom of
the page, the main page that a person would be looking for is not there, so this page does not satisfy
the needs of any users or situations.
EMPTY SERP

A search result page with zero results. Such pages typically contain error messages like “no results are found”, etc.

Empty SERP – JUNK: Page displays 0 items with a message of “0 results found”.
OTHER ERROR MESSAGE

Majority of the page content is missing due to a stated error page (e.g. data base error, flash or JavaScript related
errors, etc...) Besides the error message, the page main body does not contain any valuable information for the
users

Other Error – JUNK: The main body of the page is just an error message
WEB SEARCH RESULT PAGE

Result page from web/image/video/news search engine/aggregator.


Web SERP (Web SERP) does not include specialized search engine results like job/local/real
estate/YellowPage/WhitePage/comparison shopping/etc… search results. Sites such as Youtube.com and Amazon
search results and products internal to the website, which is NOT Web SERP junk.

Web SERP – JUNK: This page is a search result aggregator with no specialized content.
Web SERP – NOT JUNK: Job/local/people/etc SERP is not a junk page (any specialized search page).
Web SERP – NOT JUNK: Internal SERP (amazon, ebay, etc.) is not Junk.
LOW OR NO CONTENT

The page contains no or little useful content (e.g. page with an empty body surrounded by navigation bars, profile
that isn’t filled out, or a page that is only ads).
A page with just images on it (even just one) are *not* low content pages. A page that requests a download (like a
PDF, etc.) is NOT Junk. It may seem like a No Content Page, but check to see if your browser is asking if you want to
download a file.

Low/No Content – JUNK: Main body is completely empty


UNDER CONSTRUCTION/ABANDONED

The page is either under construction or abandoned so that contains low or no functionality. Also included are
default pages of a web hosting service without any useful information on the page.

Under Construction – JUNK: Page displays an ‘under construction’ message with no added content.
PARKED DOMAIN/BOILERPLATE

Parked Domain pages are pages that are hosted by a domain parking service. These pages have no purpose other
than as a placeholder for (possible) future content.
NOTE: If the functionality of the site is not severely impacted (i.e., you can still navigate around it or find valuable
information), do not mark it as junk .

Parked Domain – JUNK: This is a Parked Domain page hosted by a domain purely for placeholder
services.
Boilerplate Template – JUNK: This is an empty blog page. No info is provided besides the default
content.
ADS ONLY

The main content of the page only consists of ads with no additional content.

Ads Only – JUNK: Page content has ads only and no additional content.

MARKUP FILE

Xml, JavaScript code, or CSS file. On their own, they provide no user benefit, since users cannot make sense of
them. These files are either resources of another page or some applications’ data files

Markup File – JUNK: this is a robots.txt file which should not be index by the search engine.
REDIRECTION

Page URL is not same as original URL. This also includes redirect to homepage.

Redirection – JUNK: Page is getting redirected to another URL.


https://www.shmoop.com/the-outsiders/chapter-7-summary.html

Redirection – JUNK: Page is getting redirected to home page.


https://www.automotiveuploads.com/instant-justice-police-2018-idiot-drivers-vs-police-63_5fac960fb.html
SOFT REDIRECTION

Page contains a link referring to the target page. Besides the link, the page does not contain any useful content.
Soft Redirection Junk excludes the site entrance page which usually contains nice images, flash or brief text
introduction with a link “enter the site” because the image/flash/text provides valuable information about the site
and hence makes this page useful.

Soft Redirection – JUNK: The sole purpose of this page is to redirect you outside of the website.
Not Soft Redirection – NOT JUNK: The page is a clickable image for people to enter the site. It is not soft
redirection since the image conveys information about the site.
Not Soft Redirection – NOT JUNK: This page is a disclaimer for the site and provides useful information about terms
of use of the site that users need to know before entering, therefore it is not a junk page
ADDITIONAL CLARIFICATIONS

NOT SECURE/ NOT PRIVATE PAGES


Pages which say ‘connection is not private’ or ‘site is not secure’ are not to be considered JUNK by default. Click
on ‘details’ or ‘advanced’ and proceed to the page to judge.

NOT JUNK LOGIN PAGES

Login Pages are considered NOT JUNK. They provide some functionality to users (especially for users looking for
that specific login page) and in many cases, contain a lot of good information behind the login screen.

Login Page 1 – NOT JUNK: The below page may not have a lot of content, but it is a useful login page for someone
trying to log into the service
Login Page 2 – NOT JUNK: The below page is similar to the above example, where there is not a lot of content, but
provides value to someone who wants to log into the service.
OUT OF STOCK/NO LONGER AVAILABLE PAGES

Pages that contain items or services that are no longer available to the user visiting the site. This includes products,
businesses, services, or any additional consumer feature of the page that is no longer available. They provide little
to no value to users, and require them to go visit other pages to accomplish their tasks. These pages can often look
like good pages, but with messages like “This product is out of stock” or “This business has closed”.

Out of Stock Page – NOT JUNK. This page is showing a product that is no longer in stock on this
partiuclar website. This page doesn’t provide any user value and is considered junk.
Business Closed Page – NOT JUNK. This page is showing a yellow pages entry of a business that is now
closed, which is shown by the “This business has CLOSED” banner near the top of the screen. This page
is considered junk.

Product No Longer Available Page – NOT JUNK. This page shows a product that is no longer available for
purchase. This page provides very little user value and is considered JUNK.

You might also like