0% found this document useful (0 votes)

76 views6 pages

Sitemap XML Selector - Web Scraper Documentation

Uploaded by

23020028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views6 pages

Sitemap XML Selector - Web Scraper Documentation

Uploaded by

23020028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Documentation

Installation

Open Web Scraper

Scraping a site

Selectors

Text selector

Link selector

Sitemap xml selector

Image selector

Table selector

Element attribute selector

HTML selector

Grouped selector

Element selector

Element scroll down selector

Element click selector

Pagination selector

CSS selector

Web Scraper Cloud

Sitemap sync
Notifications

Data quality control

API

Webhooks

Scheduler

Data Export

Parser

Replace text

Regex match

Append and prepend text

Strip HTML

Remove whitespaces

Remove column

Virtual column

Convert UNIX timestamp

[Link] link selector

[Link] link selector can be used similarly as Link selector to get to
target pages (for example product pages). By using this selector, the
whole site can be traversed without setting up selectors for pagination or
other site navigation. The [Link] link selector extracts URLs from
[Link] files which websites publish so that search engine crawlers
can navigate the sites easier. In most cases, they contain all of the sites
relevant page URLs.

Web Scraper supports standard [Link] format. The [Link] file

can also be compressed ( [Link] ). If a [Link] contains URLs
to other [Link] files, the selector will work recursively to find all
URLs in sub [Link] files.
Note! Web Scraper has download size limit. If multiple [Link] URLs
are used, scraping job might fail due to exceeding the limit. To work
around this, try splitting the sitemap into multiple sitemaps, where each
sitemap has only one [Link].

Note! Sites that have [Link] files are sometimes quite large. We
recommend using Web Scraper Cloud for large volume scraping.

Configuration options
[Link] urls - list of URLs of the sites [Link] files. Multiple
URLs can be added. By clicking on "Add from [Link]" Web
Scraper will automatically add all [Link] URLs that can be
found in sites [Link] file. If no URLs are
found, it is worth checking [Link] URL
which might contain a [Link] file that isn't listed in the
[Link] file.
found URL RegEx (optional) - regular expression to match a
substring from the URLs. If set, only URLs from [Link] that
match RegEx will be scraped.
minimum priority (optional) - minimum priority of URLs to be
scraped. Inspect the [Link] file to decide if this value should be
filled.

Use cases
[Link] files are usually used for sites that want to be indexed by
search engines, sitemaps can be found for most:

e-commerce sites;
travel sites;
news sites;
yellow pages.

Best way to scrape the whole site is by using [Link] link selector. It
removes the necessity of dealing with pagination, categories and search
forms/queries. Some sites don't display category tree(breadcrumbs) if the
page is opened directly. In these cases site has to be traversed through
category pages to scrape the category tree.

Making sure that only specific pages are scraped

As in most cases, [Link] contains all pages of the site, it is possible

to limit the scraper so it scrapes only the pages that contain the required
data. For example, e-commerce sites [Link] will contain of product
pages, category pages and contact/about/etc. pages. To limit the scraper,
so that it scrapes only product pages, one or more methods can be used:
Using RegEx - if all product URLs contain a specific string that other
type pages don't contain, then this string can be set in RegEx field
and the scraper will traverse only pages that match it. For example,
/product/ . This will prevent the scraper from traversing and scraping
unnecessary pages.
Setting priority - some sites prioritize specific page types over the
others. If that is the case, setting priority will improve scraped page
precision.
Using wrapper element selector - if none of the previously
mentioned methods are possible, an element wrapper selector can
be set up. This method works for all sites and doesn't return empty
records in the result file if invalid or unnecessary page is traversed. To
set up the element wrapper selector, follow these steps:

1. Open a few pages that needs to be scraped.

2. Find an element that can be found only in these type of pages,
for example a product title [Link]-title .
3. Create an element selector and set it as a child selector for
[Link] link selector.
4. Set element selector to multiple and set its selector to (for
example) body:has([Link]-title) .
5. Select rest of the selectors as child selectors for this element
selector.
The key part of this method is that a unique element has to be found
and included in body:has(unique_selector) selector. If the data from
meta tags has to be scraped, html tag can be used instead of body
tag. Scraper will extract data only from the pages that have this
unique element.
When using [Link] selector, set the main page of the site as a start
URL.

Was this page helpful?

YES NO

How should we improve this page? (optional)

Submit

PRODUCTS

Web Scraper browser

extension
Web Scraper Cloud
COMPANY

About us
Contact
Website Privacy Policy
Browser Extension Privacy
Policy
Media kit
Jobs
RESOURCES

Blog
Documentation
Video Tutorials
Screenshots
Test Sites
Forum
Status
CONTACT US

info@[Link]
Ubelu 5-71,
Adazi, Latvia, LV-2164

    

Scraping A Site - Web Scraper Documentation
No ratings yet
Scraping A Site - Web Scraper Documentation
7 pages
Text Selector - Web Scraper Documentation
No ratings yet
Text Selector - Web Scraper Documentation
7 pages
Web Scraping Basics for Beginners
No ratings yet
Web Scraping Basics for Beginners
12 pages
Link Selector - Web Scraper Documentation
No ratings yet
Link Selector - Web Scraper Documentation
5 pages
Selectors - Web Scraper Documentation
No ratings yet
Selectors - Web Scraper Documentation
6 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Webscraper Io
No ratings yet
Webscraper Io
24 pages
Mango Details Web Scrapping: Project
No ratings yet
Mango Details Web Scrapping: Project
3 pages
Eclipse Foundation: Home Downloads Users Members Committers Resources Projects
No ratings yet
Eclipse Foundation: Home Downloads Users Members Committers Resources Projects
22 pages
Web Scraping Techniques and Tools
No ratings yet
Web Scraping Techniques and Tools
3 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Integrasi Level Antarmuka Pengguna
No ratings yet
Integrasi Level Antarmuka Pengguna
20 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Developing Products Alert System Users Using HtmlData and
No ratings yet
Developing Products Alert System Users Using HtmlData and
9 pages
Web Scraping
No ratings yet
Web Scraping
51 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
7 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Beautiful Soup & Selenium Web Scraping Guide
No ratings yet
Beautiful Soup & Selenium Web Scraping Guide
5 pages
Broad Crawls - Scrapy 2.12.0 Documentation
No ratings yet
Broad Crawls - Scrapy 2.12.0 Documentation
3 pages
A Practical Guide To Web Scraping (PDFDrive)
No ratings yet
A Practical Guide To Web Scraping (PDFDrive)
107 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
B - 2 CIE Web Scraping
No ratings yet
B - 2 CIE Web Scraping
8 pages
Scraping Amazon Product Data Guide
No ratings yet
Scraping Amazon Product Data Guide
19 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Sitemaps
No ratings yet
Sitemaps
2 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Benchmaster Documentation
No ratings yet
Benchmaster Documentation
12 pages
Web Mining
No ratings yet
Web Mining
26 pages
Seo Checklist: Your Definitive Technical
100% (1)
Seo Checklist: Your Definitive Technical
18 pages
Web Data Analytics Practical Journal
No ratings yet
Web Data Analytics Practical Journal
55 pages
Scraperapi Web Scrapping The Basics Explained
No ratings yet
Scraperapi Web Scrapping The Basics Explained
15 pages
URL Crawling - Understanding Document
No ratings yet
URL Crawling - Understanding Document
4 pages
Web Scraping - Notes - 321
No ratings yet
Web Scraping - Notes - 321
3 pages
EJMCM Volume7 Issue3 Pages433-442
No ratings yet
EJMCM Volume7 Issue3 Pages433-442
11 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Web Scraping: Techniques and Tools
No ratings yet
Web Scraping: Techniques and Tools
10 pages
Web Scraping CheatSheet Guide
No ratings yet
Web Scraping CheatSheet Guide
10 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
Scrapingquickstart
No ratings yet
Scrapingquickstart
32 pages
Web Scraping with BeautifulSoup
No ratings yet
Web Scraping with BeautifulSoup
7 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
Web Scraping: Tools and Techniques
No ratings yet
Web Scraping: Tools and Techniques
1 page
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
Product Info Scrapper
No ratings yet
Product Info Scrapper
18 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
WWW Screamingfrog ...
No ratings yet
WWW Screamingfrog ...
16 pages
Unit I
No ratings yet
Unit I
12 pages
Sitemap - Complete Guideline
No ratings yet
Sitemap - Complete Guideline
40 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Neo4j v3.5 Status Codes Overview
No ratings yet
Neo4j v3.5 Status Codes Overview
11 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Integrity Constraints
No ratings yet
Integrity Constraints
4 pages
Ramesh Angamuthu - Data Engineer-SSIS - SSRS - Azure - SQL-2024
No ratings yet
Ramesh Angamuthu - Data Engineer-SSIS - SSRS - Azure - SQL-2024
7 pages
Database Management Assignment
No ratings yet
Database Management Assignment
2 pages
Period Start Time PLMN Namernc Name: Ps Drop Call Ratehsdpa Drop Rate
No ratings yet
Period Start Time PLMN Namernc Name: Ps Drop Call Ratehsdpa Drop Rate
39 pages
Normalization of Database Tables
No ratings yet
Normalization of Database Tables
52 pages
PL SQL K Online Material
80% (5)
PL SQL K Online Material
98 pages
Vansh Srivastava 31 12 2024
No ratings yet
Vansh Srivastava 31 12 2024
2 pages
Data Science and Big Data Overview
No ratings yet
Data Science and Big Data Overview
5 pages
Information Retrieval
No ratings yet
Information Retrieval
4 pages
SQL Joins Query PDF
No ratings yet
SQL Joins Query PDF
9 pages
A211 - MTD3033 Assignment 1 Concept (Group)
No ratings yet
A211 - MTD3033 Assignment 1 Concept (Group)
11 pages
30 SQL and Database Design Questions From Data Science Interviews at Top Tech Companies
No ratings yet
30 SQL and Database Design Questions From Data Science Interviews at Top Tech Companies
27 pages
Concurrency Control in DBMS
No ratings yet
Concurrency Control in DBMS
12 pages
Configure Deploy OrgChartApplication Ear
No ratings yet
Configure Deploy OrgChartApplication Ear
8 pages
Unit-2 DBMS
No ratings yet
Unit-2 DBMS
97 pages
Spatial Data Integration Guide
No ratings yet
Spatial Data Integration Guide
24 pages
Dhiraj Upadhyay: MBA & Data Science Skills
No ratings yet
Dhiraj Upadhyay: MBA & Data Science Skills
1 page
4103 Legacy Data Migration To SAP ECC PDF
No ratings yet
4103 Legacy Data Migration To SAP ECC PDF
31 pages
Reference File On Google Sheets
No ratings yet
Reference File On Google Sheets
5 pages
SQL Syllabus
No ratings yet
SQL Syllabus
4 pages
Trinity College Library: XXXXXXXX@TCD - Ie
No ratings yet
Trinity College Library: XXXXXXXX@TCD - Ie
1 page
Dbms Unit II
No ratings yet
Dbms Unit II
49 pages
SEO Services for Enhanced Website Traffic
No ratings yet
SEO Services for Enhanced Website Traffic
7 pages
SQL1
0% (2)
SQL1
20 pages
Chapter 11 Answers
100% (1)
Chapter 11 Answers
13 pages
Key Difference Between OLTP and OLAP Databases
No ratings yet
Key Difference Between OLTP and OLAP Databases
7 pages
SSIP Coding
No ratings yet
SSIP Coding
11 pages
Unit-III Notes
No ratings yet
Unit-III Notes
33 pages