Professional Documents
Culture Documents
Web Clean Body Identification
Web Clean Body Identification
Quality Guidelines
New Version 7.1!! Pay attention to:
- New judging option: Gray area. Select when the highlight is neither typical primary nor non-
primary
- Definition changes
o Reviews and site notifications are now back to non-primary. Typical non-primary
content
- Examples changes
o Reviews are no longer gray. Gray area in tourism review page
o Reviews are no longer gray. Gray area in food/service review page
o Reviews are no longer gray. Gray area in product page
o Categories at bottom is not gray. Gray area in wiki page
o Most contents except header/footer are primary on Home page
Your Task
In this HitApp, you will be given a webpage and a text block with green highlight: choose the check box
to decide whether the content in the text block belongs to the primary content of the webpage.
We already highlighted the text for you on the page. You just need to scroll down and find the green
highlight and make the decision:
If the highlight is a typical primary content, select Yes, this is a typical primary content.
If the highlight is a typical non-primary, choose No, this is a typical non-primary.
If the highlight is somewhere in between typical primary and non-primary, our guideline didn’t
clearly mention and you are not sure which side it belongs to, select This is Gray area.
If the page doesn’t contain any green highlight like below, or you need to click some buttons, e.g.
click the drop-down list, open the collapsed section, just choose “it’s invisible”.
If the page can’t load, or there’s a big pop-up blocks all the background webpage, choose Skip.
Prerequisite: Instructions for Invoking Chrome
For this HitApp, we suggest always use Chrome browser. Please follow below instructions to correctly
setup and invoke Chrome which helps you auto skip the spam/secure screen alert.
1. If you don’t have Chrome browser, please download Chrome first and add a shortcut on your
desktop
2. Depending on your local desktop configuration, you might need Admin permissions to make the
below changes
3. Right click on the shortcut on the desktop and select “Properties”
4. In the Properties window, ensure that you are on the “Shortcut” tab (see image below)
5. In the “Target:” property, you will see the path to “chrome.exe” listed. Click on this path,
proceed till the end, add a whitespace and then add the below switch after that:
“ --allow-running-insecure-content"
a. Quotes are intentional and should be included
b. Notice that there is a whitespace after the starting quote
6. Click on “Apply”, then “OK” and close the Properties window
7. Always launch Chrome by double clicking on this updated shortcut when using this HitApp
Primary Content Definition
Primary Content is the content of a webpage that is the reason a user wants to visit the webpage.
If you are a user who visits a webpage on SeattleTimes.com, you want to read the text of the article that
a journalist wrote, not the ads that are interspersed throughout the article, or the related articles, or the
navigation bar on top. All that content might distract you and cause you to click on it, but the main
reason you visit an article page is to read the actual article. Primary Content is that content, but for any
page on the web. Similarly, for forums, product pages, questions & answers sites, search pages,
the Primary Content is the main reason you are visiting the page.
Secondary Content is the content that even they are removed from the page, the web page
remains valuable. In other words, secondary contents are not the reason you visit the page.
They don’t contribute to the main information on the page, e.g. ads and page header are the same
across website. Even without ads, header, footer, related articles, the core value of the page is not
affected.
Content that are useless comparing to the core information of the page. In other words, if we remove
them, it doesn’t affect the core content on the page:
o Buttons are usually not Primary.
o E.g. shopping cart button, login button, search button, buy button
Headers and footers
Usually the very top and very bottom of a web page
Web site notifications,
E.g. Wiki donation banner, Web site terms of use policy changing notifications.
User review/comment sections
Navigation links/sections:
o This includes the top or side menus on homepages or other webpages
which repeats across many documents on the website
o This excludes “External links” on Wikipedia which should be considered Primary
Content – this is unique content for this document.
Related, trending, recommended or sponsored articles/videos/products etc. sections:
o This includes sections which are secondary to the main content of the webpage and
contain list of similar or recommended articles, videos, products or other types of content.
o This excludes “See also” section on Wikipedia which should be considered Primary
Content – this is unique content for this document.
Sidebars that are not related to article content. Usually advertisements.
This excludes Zillow, Redfin, etc, whose majority of content is map.
This excludes Opening/Working Time in sidebar, which is related with the primary
content
Site specific copyright information
Usually at very bottom, or end of articles
Search bars and Forms: comment forms, site search forms, login forms, etc.
Search bars usually at beginning of the page. Comment forms are usually together with
comment/review sections.
Ads
Social media links/panels
When it’s hard to make the decision, or guideline didn’t cover the case, just select Gray area.
Typical Article Page Example
WebPage: Seattle times
Breadcrumbs, Article, Title, Author, Date, Author intro, image description, all primary.
Header, Footer, Comments, Ads, Related/Recommended contents are not primary.
Tourism Review Page Example
Example WebPage The product/service/places intro, price, spec, location, contact are primary
Header, Footer, Ads, Side bars used for filters are not primary.
Food/Service Review Page Example
Example https://www.yelp.com/biz/pinecrest-bakery-pinecrest-pinecrest .
Restaurant/Service intro, menus, are primary.
Header, Footer, Ads, Other recommended Restaurants/Services are not primary.
Forum Page Example
Example WebPage:
https://www.reddit.com/r/deeplearning/comments/jgnokl/is_the_practical_deep_learning_for_coders_
course/
>>> Special for wiki pages, See also, References are Primary as well. <<<