0% found this document useful (0 votes)
76 views6 pages

Sitemap XML Selector - Web Scraper Documentation

Uploaded by

23020028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views6 pages

Sitemap XML Selector - Web Scraper Documentation

Uploaded by

23020028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Documentation

Installation

Open Web Scraper

Scraping a site

Selectors

Text selector

Link selector

Sitemap xml selector

Image selector

Table selector

Element attribute selector

HTML selector

Grouped selector

Element selector

Element scroll down selector

Element click selector

Pagination selector

CSS selector

Web Scraper Cloud

Sitemap sync
Notifications

Data quality control

API

Webhooks

Scheduler

Data Export

Parser

Replace text

Regex match

Append and prepend text

Strip HTML

Remove whitespaces

Remove column

Virtual column

Convert UNIX timestamp

[Link] link selector


[Link] link selector can be used similarly as Link selector to get to
target pages (for example product pages). By using this selector, the
whole site can be traversed without setting up selectors for pagination or
other site navigation. The [Link] link selector extracts URLs from
[Link] files which websites publish so that search engine crawlers
can navigate the sites easier. In most cases, they contain all of the sites
relevant page URLs.

Web Scraper supports standard [Link] format. The [Link] file


can also be compressed ( [Link] ). If a [Link] contains URLs
to other [Link] files, the selector will work recursively to find all
URLs in sub [Link] files.
Note! Web Scraper has download size limit. If multiple [Link] URLs
are used, scraping job might fail due to exceeding the limit. To work
around this, try splitting the sitemap into multiple sitemaps, where each
sitemap has only one [Link].

Note! Sites that have [Link] files are sometimes quite large. We
recommend using Web Scraper Cloud for large volume scraping.

Configuration options
[Link] urls - list of URLs of the sites [Link] files. Multiple
URLs can be added. By clicking on "Add from [Link]" Web
Scraper will automatically add all [Link] URLs that can be
found in sites [Link] file. If no URLs are
found, it is worth checking [Link] URL
which might contain a [Link] file that isn't listed in the
[Link] file.
found URL RegEx (optional) - regular expression to match a
substring from the URLs. If set, only URLs from [Link] that
match RegEx will be scraped.
minimum priority (optional) - minimum priority of URLs to be
scraped. Inspect the [Link] file to decide if this value should be
filled.

Use cases
[Link] files are usually used for sites that want to be indexed by
search engines, sitemaps can be found for most:

e-commerce sites;
travel sites;
news sites;
yellow pages.

Best way to scrape the whole site is by using [Link] link selector. It
removes the necessity of dealing with pagination, categories and search
forms/queries. Some sites don't display category tree(breadcrumbs) if the
page is opened directly. In these cases site has to be traversed through
category pages to scrape the category tree.

Making sure that only specific pages are scraped

As in most cases, [Link] contains all pages of the site, it is possible


to limit the scraper so it scrapes only the pages that contain the required
data. For example, e-commerce sites [Link] will contain of product
pages, category pages and contact/about/etc. pages. To limit the scraper,
so that it scrapes only product pages, one or more methods can be used:
Using RegEx - if all product URLs contain a specific string that other
type pages don't contain, then this string can be set in RegEx field
and the scraper will traverse only pages that match it. For example,
/product/ . This will prevent the scraper from traversing and scraping
unnecessary pages.
Setting priority - some sites prioritize specific page types over the
others. If that is the case, setting priority will improve scraped page
precision.
Using wrapper element selector - if none of the previously
mentioned methods are possible, an element wrapper selector can
be set up. This method works for all sites and doesn't return empty
records in the result file if invalid or unnecessary page is traversed. To
set up the element wrapper selector, follow these steps:

1. Open a few pages that needs to be scraped.


2. Find an element that can be found only in these type of pages,
for example a product title [Link]-title .
3. Create an element selector and set it as a child selector for
[Link] link selector.
4. Set element selector to multiple and set its selector to (for
example) body:has([Link]-title) .
5. Select rest of the selectors as child selectors for this element
selector.
The key part of this method is that a unique element has to be found
and included in body:has(unique_selector) selector. If the data from
meta tags has to be scraped, html tag can be used instead of body
tag. Scraper will extract data only from the pages that have this
unique element.
When using [Link] selector, set the main page of the site as a start
URL.

Was this page helpful?

YES NO

How should we improve this page? (optional)

Submit

PRODUCTS

Web Scraper browser


extension
Web Scraper Cloud
COMPANY

About us
Contact
Website Privacy Policy
Browser Extension Privacy
Policy
Media kit
Jobs
RESOURCES

Blog
Documentation
Video Tutorials
Screenshots
Test Sites
Forum
Status
CONTACT US

info@[Link]
Ubelu 5-71,
Adazi, Latvia, LV-2164

    

Copyright © 2024 Web Scraper | All rights reserved

You might also like