You are on page 1of 3

Web Scraping

Cheat Sheet

BS4 | Selenium | Scrapy

Artificial Corner
Web Scraping “Siblings” are nodes with the same parent.
It’s recommended for beginners to use IDs to find
XPath

Cheat Sheet
We need to learn XPath to scrape with Selenium or
elements and if there isn't any build an XPath.
Scrapy.

Web Scraping is the process of extracting data from a


Beautiful Soup XPath Syntax
website. Before studying Beautiful Soup and Selenium, it's Workflow An XPath usually contains a tag name, attribute
good to review some HTML basics first. Importing the libraries name, and attribute value.
from bs4 import BeautifulSoup
import requests //tagName[@AttributeName="Value"]
HTML for Web Scraping
Let's take a look at the HTML element syntax. Fetch the pages Let’s check some examples to locate the article,
result=requests.get("www.google.com") title, and transcript elements of the HTML code we
Tag Attribute Attribute result.status_code # get status code
name name value End tag result.headers # get the headers used before.

Page content //article[@class="main-article"]


<h1 class="title"> Titanic (1997) </h1> content = result.text
//h1
Create soup //div[@class="full-script"]
Attribute Affected content soup = BeautifulSoup(content,"lxml")

HTML Element HTML in a readable format XPath Functions and Operators


print(soup.prettify())
XPath functions
This is a single HTML element, but the HTML code behind a Find an element
//tag[contains(@AttributeName, "Value")]
website has hundreds of them. soup.find(id="specific_id")
HTML code example Find elements XPath Operators: and, or
<article class="main-article"> soup.find_all("a")
soup.find_all("a","css_class") //tag[(expression 1) and (expression 2)]
<h1> Titanic (1997) </h1>
soup.find_all("a",class_="my_class")
<p class="plot"> 84 years later ... </p> soup.find_all("a",attrs={"class": XPath Special Characters
<div class="full-script"> 13 meters. You ... </div> "my_class"})
Get inner text Selects the children from the node set on the
</article> /
sample = element.get_text() left side of this character
sample = element.get_text(strip=True, Specifies that the matching node set should
The HTML code is structured with “nodes”. Each rectangle below separator= ' ') //
represents a node (element, attribute and text nodes) Get specific attributes be located at any level within the document
sample = element.get('href') Specifies the current context should be used
Root Element Parent Node
. (refers to present node)
<article>
- Medium Guides/YouTube Tutorials
..
Here are my guides/tutorials and courses Refers to a parent node
A wildcard character that selects all
Element Attribute Element Element - Web Scraping Course * elements or attributes regardless of names
<h1> class="main-article" <p> <div>
Siblings - Data Science Course @ Select an attribute
- Automation Course () Grouping an XPath expression
Text Attribute Text Attribute Text
Titanic (1997) class="plot" 84 years later ... class="full-script"" 13 meters. You ... - Make Money Using Programming Skills Indicates that a node with index "n" should
[n]
be selected
Made by Frank Andrade: artificialcorner.com
Selenium 4 Scrapy
Note that there are a few changes between Selenium 3.x versions and Scrapy is the most powerful web scraping framework in Python, but it's a bit
Selenium 4. complicated to set up, so check my guide or its documentation to set it up.
Import libraries:
from selenium import webdriver Creating a Project and Spider
from selenium.webdriver.chrome.service import Service To create a new project, run the following command in the terminal.
scrapy startproject my_first_spider
web="www.google.com" To create a new spider, first change the directory.
path='introduce chromedriver path' cd my_first_spider
service = Service(executable_path=path) # selenium 4 Create an spider
driver = webdriver.Chrome(service=service) # selenium 4 scrapy genspider example example.com
driver.get(web)
The Basic Template
Note: When you create a spider, you obtain a template with the following content.
driver = webdriver.Chrome(path) # selenium 3.x
import scrapy
Find an element
driver.find_element(by="id", value="...") # selenium 4 class ExampleSpider(scrapy.Spider):
driver.find_element_by_id("write-id-here") # selenium 3.x name = 'example'
allowed_domains = ['example.com'] Class
Find elements start_urls = ['http://example.com/']
driver.find_elements(by="xpath", value="...") # selenium 4
driver.find_elements_by_xpath("write-xpath-here") # selenium 3.x
def parse(self, response):
Parse method
Quit driver pass
driver.quit()
The class is built with the data we introduced in the previous command, but the
Getting the text parse method needs to be built by us. To build it, use the functions below.
data = element.text
Finding elements
Implicit Waits To find elements in Scrapy, use the response argument from the parse method
import time
time.sleep(2) response.xpath('//tag[@AttributeName="Value"]')
Getting the text
Explicit Waits To obtain the text element we use text() and either .get() or .getall(). For example:
from selenium.webdriver.common.by import By response.xpath(‘//h1/text()’).get()
from selenium.webdriver.support.ui import WebDriverWait response.xpath(‘//tag[@Attribute=”Value”]/text()’).getall()
from selenium.webdriver.support import expected_conditions as EC
Return data extracted
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, To see the data extracted we have to use the yield keyword
'id_name')))
# Wait 5 seconds until an element is clickable def parse(self, response):
title = response.xpath(‘//h1/text()’).get()
Options: Headless mode, change window size
from selenium.webdriver.chrome.options import Options # Return data extracted
options = Options() yield {'titles': title}
options.headless = True
options.add_argument('window-size=1920x1080') Run the spider and export data to CSV or JSON
driver=webdriver.Chrome(service=service,options=options) scrapy crawl example
scrapy crawl example -o name_of_file.csv
scrapy crawl example -o name_of_file.json

You might also like