You are on page 1of 7

STARTING YOUR NEXT

SCRAPING PROJECT
This command will create your project setup including the files needed
to support your scrapy project.

Scrapy startproject projectname

scrapy.cfg # deploy configuration file

tutorial/ # project's Python module, you'll import


your code from here
__init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # a directory where you'll later put your


spiders
__init__.py

For establishing items, you can add them in the items.py file and then
add in the spyder:

from scrapy.item import Item, Field

class CustomItem(Item):
one_field = Field()
another_field = Field()
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()

One of Scrapy’s main advantages is the use of selectors such as:

response.selector.xpath('//span/text()').get()
response.xpath('//span/text()').get()
response.css('span::text').get()
Scrapy tool Global Commands:

startproject genspider settings runspider

shell fetch view version

You can obtain more info for each with: scrapy <command> -h
or see available commands with: scrapy -h
Launch scrapy shell - scrapy shell urlhere
Once the shell loads you will have the response and you can use
selectors and xpath such as:

response.xpath('//title/text()')

Assign Items/Pipeline, etc


Create Class
Assign URL
Parse response/parameters
Debug Scrapy Shell if needed

Scrapy uses logging for event logging. We’ll provide some simple
examples to get you started, but for more advanced use-cases it’s
strongly suggested to read thoroughly its documentation.

import logging
logging.warning("This is a warning")

Email:

from scrapy.mail import MailSender


mailer = MailSender()
https://docs.scrapy.org/en/latest/
https://docs.scrapy.org/en/latest/topics/items.html
https://docs.scrapy.org/en/latest/topics/logging.html

BEAUTIFULSOUP

How to view the document:

from bs4 import BeautifulSoup


soup = BeautifulSoup(html_doc, 'lxml') #or any other parser
print(soup.prettify())

Work through the data structure by selecting elements such as:

soup.title
soup.p
soup.title.name
soup.find_all(“class”)

One common task is extracting all the URLs found within a page’s <a>
tags:

for link in soup.find_all('a'):


print(link.get('href'))

Another common task is extracting all the text from a page:

print(soup.get_text())

Send a request to a URL for BeautifulSoup:

url = ‘Google.com’
headers {‘header info if needed’}
r = request.get(url, headers=headers)
soup = BeautifulSoup(r.text, ‘lxml’)
Requests

>>> r = requests.get('https://api.github.com/user', auth=('user',


'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
’application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
’{"type":"User"...’

>>> r.json()

https://requests.readthedocs.io/en/master/

Selenium element options:

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

To find multiple elements (these methods will return a list):

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

Apart from the public methods given above, there are two private
methods which might be useful with locators in page objects. These
are the two private methods: find_element and find_elements.
Example usage:

from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, '//button[text()="Some text"]')


driver.find_elements(By.XPATH, '//button')

Or:

from selenium import webdriver


from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import
presence_of_element_located

#This example requires Selenium WebDriver 3.13 or newer


with webdriver.Firefox() as driver:
wait = WebDriverWait(driver, 10)
driver.get("https://google.com/ncr")
driver.find_element(By.NAME, "q").send_keys("cheese" +
Keys.RETURN)
first_result =
wait.until(presence_of_element_located(By.CSS_SELECTOR, "h3>div"))
print(first_result.get_attribute("textContent"))

Specify the use of options with Selenium and Chrome, such as a


load_strategy

from selenium import webdriver


from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options=options)
# Navigate to url
driver.get("http://www.google.com")
driver.quit()

We can also use the following for delay’s or allowing content to load via
time:

time.sleep(10)#specify the amount of seconds in parameters


Selenium's WebDriverWait:
This is a simple method, but Selenium allows for explicit waits. For
example:

from selenium.webdriver.support.ui import WebDriverWait


def document_initialised(driver):
return driver.execute_script("return initialised")

driver.navigate("localfileexample.html")
WebDriverWait(driver).until(document_initialised)
el = driver.find_element(By.TAG_NAME, "p")
assert el.text == "Hellooooo Scraping"

https://www.selenium.dev/documentation/en/webdriver/waits/

You might also like