Documentation
Installation
Open Web Scraper
Scraping a site
Selectors
Text selector
Link selector
Sitemap xml selector
Image selector
Table selector
Element attribute selector
HTML selector
Grouped selector
Element selector
Element scroll down selector
Element click selector
Pagination selector
CSS selector
Web Scraper Cloud
Sitemap sync
Notifications
Data quality control
API
Webhooks
Scheduler
Data Export
Parser
Replace text
Regex match
Append and prepend text
Strip HTML
Remove whitespaces
Remove column
Virtual column
Convert UNIX timestamp
[Link] link selector
[Link] link selector can be used similarly as Link selector to get to
target pages (for example product pages). By using this selector, the
whole site can be traversed without setting up selectors for pagination or
other site navigation. The [Link] link selector extracts URLs from
[Link] files which websites publish so that search engine crawlers
can navigate the sites easier. In most cases, they contain all of the sites
relevant page URLs.
Web Scraper supports standard [Link] format. The [Link] file
can also be compressed ( [Link] ). If a [Link] contains URLs
to other [Link] files, the selector will work recursively to find all
URLs in sub [Link] files.
Note! Web Scraper has download size limit. If multiple [Link] URLs
are used, scraping job might fail due to exceeding the limit. To work
around this, try splitting the sitemap into multiple sitemaps, where each
sitemap has only one [Link].
Note! Sites that have [Link] files are sometimes quite large. We
recommend using Web Scraper Cloud for large volume scraping.
Configuration options
[Link] urls - list of URLs of the sites [Link] files. Multiple
URLs can be added. By clicking on "Add from [Link]" Web
Scraper will automatically add all [Link] URLs that can be
found in sites [Link] file. If no URLs are
found, it is worth checking [Link] URL
which might contain a [Link] file that isn't listed in the
[Link] file.
found URL RegEx (optional) - regular expression to match a
substring from the URLs. If set, only URLs from [Link] that
match RegEx will be scraped.
minimum priority (optional) - minimum priority of URLs to be
scraped. Inspect the [Link] file to decide if this value should be
filled.
Use cases
[Link] files are usually used for sites that want to be indexed by
search engines, sitemaps can be found for most:
e-commerce sites;
travel sites;
news sites;
yellow pages.
Best way to scrape the whole site is by using [Link] link selector. It
removes the necessity of dealing with pagination, categories and search
forms/queries. Some sites don't display category tree(breadcrumbs) if the
page is opened directly. In these cases site has to be traversed through
category pages to scrape the category tree.
Making sure that only specific pages are scraped
As in most cases, [Link] contains all pages of the site, it is possible
to limit the scraper so it scrapes only the pages that contain the required
data. For example, e-commerce sites [Link] will contain of product
pages, category pages and contact/about/etc. pages. To limit the scraper,
so that it scrapes only product pages, one or more methods can be used:
Using RegEx - if all product URLs contain a specific string that other
type pages don't contain, then this string can be set in RegEx field
and the scraper will traverse only pages that match it. For example,
/product/ . This will prevent the scraper from traversing and scraping
unnecessary pages.
Setting priority - some sites prioritize specific page types over the
others. If that is the case, setting priority will improve scraped page
precision.
Using wrapper element selector - if none of the previously
mentioned methods are possible, an element wrapper selector can
be set up. This method works for all sites and doesn't return empty
records in the result file if invalid or unnecessary page is traversed. To
set up the element wrapper selector, follow these steps:
1. Open a few pages that needs to be scraped.
2. Find an element that can be found only in these type of pages,
for example a product title [Link]-title .
3. Create an element selector and set it as a child selector for
[Link] link selector.
4. Set element selector to multiple and set its selector to (for
example) body:has([Link]-title) .
5. Select rest of the selectors as child selectors for this element
selector.
The key part of this method is that a unique element has to be found
and included in body:has(unique_selector) selector. If the data from
meta tags has to be scraped, html tag can be used instead of body
tag. Scraper will extract data only from the pages that have this
unique element.
When using [Link] selector, set the main page of the site as a start
URL.
Was this page helpful?
YES NO
How should we improve this page? (optional)
Submit
PRODUCTS
Web Scraper browser
extension
Web Scraper Cloud
COMPANY
About us
Contact
Website Privacy Policy
Browser Extension Privacy
Policy
Media kit
Jobs
RESOURCES
Blog
Documentation
Video Tutorials
Screenshots
Test Sites
Forum
Status
CONTACT US
info@[Link]
Ubelu 5-71,
Adazi, Latvia, LV-2164
Copyright © 2024 Web Scraper | All rights reserved