Starting Your Next: Scraping Project

STARTING YOUR NEXT
SCRAPING PROJECT
This command will create your project setup including the files needed
to support your scrapy project.
Scrapy startproject projectname
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import

your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your

spiders
__init__.py
For establishing items, you can add them in the items.py file and then
add in the spyder:
from scrapy.item import Item, Field
class CustomItem(Item):
one_field = Field()
another_field = Field()
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
One of Scrapy’s main advantages is the use of selectors such as:
response.selector.xpath('//span/text()').get()
response.xpath('//span/text()').get()
response.css('span::text').get()
Scrapy tool Global Commands:
startproject genspider settings runspider
shell fetch view version
You can obtain more info for each with: scrapy <command> -h
or see available commands with: scrapy -h
Launch scrapy shell - scrapy shell urlhere
Once the shell loads you will have the response and you can use
selectors and xpath such as:
response.xpath('//title/text()')
Assign Items/Pipeline, etc

Create Class
Assign URL
Parse response/parameters
Debug Scrapy Shell if needed
Scrapy uses logging for event logging. We’ll provide some simple
examples to get you started, but for more advanced use-cases it’s
strongly suggested to read thoroughly its documentation.
import logging
logging.warning("This is a warning")
Email:
from scrapy.mail import MailSender

mailer = MailSender()
https://docs.scrapy.org/en/latest/
https://docs.scrapy.org/en/latest/topics/items.html
https://docs.scrapy.org/en/latest/topics/logging.html
BEAUTIFULSOUP
How to view the document:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'lxml') #or any other parser
print(soup.prettify())
Work through the data structure by selecting elements such as:
soup.title
soup.p
soup.title.name
soup.find_all(“class”)
One common task is extracting all the URLs found within a page’s <a>
tags:
for link in soup.find_all('a'):

print(link.get('href'))
Another common task is extracting all the text from a page:
print(soup.get_text())
Send a request to a URL for BeautifulSoup:
url = ‘Google.com’
headers {‘header info if needed’}
r = request.get(url, headers=headers)
soup = BeautifulSoup(r.text, ‘lxml’)
Requests
>>> r = requests.get('https://api.github.com/user', auth=('user',

'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
’application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
’{"type":"User"...’
>>> r.json()
https://requests.readthedocs.io/en/master/
Selenium element options:
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
To find multiple elements (these methods will return a list):
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
Apart from the public methods given above, there are two private
methods which might be useful with locators in page objects. These
are the two private methods: find_element and find_elements.
Example usage:
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//button[text()="Some text"]')

driver.find_elements(By.XPATH, '//button')
Or:
from selenium import webdriver

from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import
presence_of_element_located
#This example requires Selenium WebDriver 3.13 or newer

with webdriver.Firefox() as driver:
wait = WebDriverWait(driver, 10)
driver.get("https://google.com/ncr")
driver.find_element(By.NAME, "q").send_keys("cheese" +
Keys.RETURN)
first_result =
wait.until(presence_of_element_located(By.CSS_SELECTOR, "h3>div"))
print(first_result.get_attribute("textContent"))
Specify the use of options with Selenium and Chrome, such as a

load_strategy
from selenium import webdriver

from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options=options)
# Navigate to url
driver.get("http://www.google.com")
driver.quit()
We can also use the following for delay’s or allowing content to load via
time:
time.sleep(10)#specify the amount of seconds in parameters

Selenium's WebDriverWait:
This is a simple method, but Selenium allows for explicit waits. For
example:
from selenium.webdriver.support.ui import WebDriverWait

def document_initialised(driver):
return driver.execute_script("return initialised")
driver.navigate("localfileexample.html")
WebDriverWait(driver).until(document_initialised)
el = driver.find_element(By.TAG_NAME, "p")
assert el.text == "Hellooooo Scraping"
https://www.selenium.dev/documentation/en/webdriver/waits/

Starting Your Next: Scraping Project

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Starting Your Next: Scraping Project

Uploaded by

Copyright:

Available Formats

STARTING YOUR NEXT

Scrapy startproject projectname

scrapy.cfg # deploy conﬁguration ﬁle

tutorial/ # project's Python module, you'll import

items.py # project items deﬁnition ﬁle

middlewares.py # project middlewares ﬁle

pipelines.py # project pipelines ﬁle

settings.py # project settings ﬁle

spiders/ # a directory where you'll later put your

from scrapy.item import Item, Field

One of Scrapy’s main advantages is the use of selectors such as:

startproject genspider settings runspider

shell fetch view version

Assign Items/Pipeline, etc

from scrapy.mail import MailSender

How to view the document:

from bs4 import BeautifulSoup

Work through the data structure by selecting elements such as:

for link in soup.ﬁnd_all('a'):

Another common task is extracting all the text from a page:

Send a request to a URL for BeautifulSoup:

>>> r = requests.get('https://api.github.com/user', auth=('user',

Selenium element options:

To ﬁnd multiple elements (these methods will return a list):

from selenium.webdriver.common.by import By

driver.ﬁnd_element(By.XPATH, '//button[text()="Some text"]')

from selenium import webdriver

#This example requires Selenium WebDriver 3.13 or newer

Specify the use of options with Selenium and Chrome, such as a

from selenium import webdriver

time.sleep(10)#specify the amount of seconds in parameters

from selenium.webdriver.support.ui import WebDriverWait

You might also like