You are on page 1of 3

SCRAPY

------
Sources:
** VIDEO TUTORIAL: http://www.youtube.com/watch?v=1EFnX1UkXVU
** SCRAPY TUTORIAL: http://doc.scrapy.org/en/latest/intro/tutorial.html
** XPATH TUTORIAL/INFO: http://www.w3schools.com/xpath/
** INSTALLATION: http://doc.scrapy.org/en/latest/intro/install.html#intro-instal
l
Note: (important) Find all dependencies fit for 64-bit OS and python ver
sion
-------------------------------------------------------------------------------
---------------------------------------------------------
| STARTING A SCRAPY PROJECT
|
-------------------------------------------------------------------------------
---------------------------------------------------------
** Step 1: Go to command prompt.
** Step 2: Type in the directory where you want to store the project in
e.g. cd desktop # current directory is a
t desktop
** Step 3: Type in: scrapy startproject your_project_name (to create a new scrap
y project)
e.g. scrapy startproject mlim
Note: A folder should appear in the current working directory with the s
ame name as your_project_name
-------------------------------------------------------------------------------
---------------------------------------------------------
| SCRAPING INFORMATION FROM WEBSITES
VIA XPATH |
-------------------------------------------------------------------------------
---------------------------------------------------------
** Step 1: Open items.py
Step 1.1: add in the class the fields you would want to obtain and then
save the file.
e.g. link = Field() and etc. (remove the pass statement)
** Step 2: Create the spider using the BaseSpider class.(simplest form of a spid
er)
Step 2.1: Open a new python file and save it (file extension .py) in the
spiders folder.
Sample Code dependent on XPath:
----------------------------------------------- CODE ------------------------
--------------------------------------------------------------------------


from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
# name of spider
allowed_domains = ["craigslist.org"]
# "Homepage" or main page
start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]
# link of the page to be parsed
def parse(self, response):
# parsing function
hxs = HtmlXPathSelector(response)

titles = hxs.select("//div[@id='toc_rows']/div[2]/p/span[2]")
# main body xpath (where information to be parsed is grouped)
items = []
# list where information of each article would be stored
for titles in titles:
item = CraigslistSampleItem()
# see items.py (item object contains link and title)
item["title"] = titles.select("a/text()").extract()
# title of link xpath (from inspection)
item["link"] = titles.select("a/@href").extract()
# link xpath @href accesses the information w/in tags
items.append(item)
# appends items to list
return items
--------------------------------------------------------------------------------
--------------------------------------------------------------------------
** Step 3: Save file and go back to command prompt
** Step 4: Run the code by first changing the directory to your project folder,
type in: cd your_project_name
** Step 5: Crawl the website by running the spider:
Type in: scrapy crawl spider_name from the name attribute of the class y
ou defined.
e.g. scrapy crawl craig
** Step 6: To save the parsed information, type in: scrapy crawl spider_name -o
filename.csv -t csv
e.g. scrapy crawl craig -o items.csv -t csv # file is found in scrapy di
rectory
-------------------------------------------------------------------------------
---------------------------------------------------------
| SCRAPING INFORMATION FROM WEBSITES USIN
G BeautifulSoup |
-------------------------------------------------------------------------------
---------------------------------------------------------
Source: https://gist.github.com/davepeck/790721
Sample Code:
import re
from scrapy.link import Link
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup
class SoupLinkExtractor(object):
def __init__(self, *args, **kwargs):
super(SoupLinkExtractor, self).__init__()
allow_re = kwargs.get('allow', None)
self._allow = re.compile(allow_re) if allow_re else None

def extract_links(self, response):
raw_follow_urls = []

soup = BeautifulSoup(response.body_as_unicode())
anchors = soup.findAll('a')
for anchor in anchors:
anchor_href = anchor.get('href', None)
if anchor_href and not anchor_href.startswith('#'):
raw_follow_urls.append(anchor_href)

potential_follow_urls = [urljoin(response.url, raw_follow_url) for raw_f
ollow_url in raw_follow_urls]

if self._allow:
follow_urls = [potential_follow_url for potential_follow_url in pote
ntial_follow_urls if self._allow.search(potential_follow_url) is not None]
else:
follow_urls = potential_follow_urls

return [Link(url = follow_url) for follow_url in follow_urls]