You are on page 1of 11

Web Scrapping

From http://scrapy.org/
NP-10
Scrapy at a glance
Scrapy is an application framework for
crawling web sites and extracting structured
data which can be used for a wide range of
useful applications, like data mining,
information processing or historical archival.
it can also be used to extract data using APIs
Scrapy is written in Python
pip install scrapy
you need to extract some information from a
website, but the website doesnt provide any
API or mechanism to access that info
programmatically.
Scrapy can help you extract that information.
directory
the project configuration file
the projects python module, youll later
import your code from here.
the projects items file.
the projects pipelines file.
he projects settings file.
a directory where youll later put your
spiders.
Defining our Item
Items are containers that will be loaded with
the scraped data;
Our first Spider
Spiders are user-written classes used to scrape information from a
domai
Three main mandatory attributes:
Name
Start_urls
Parse()
Extracting Items
There are several ways to extract data from web pages
Here are some examples of XPath expressions and their meanings:

/html/head/title: selects the <title> element, inside the <head> element of a HTML
document
/html/head/title/text(): selects the text inside the aforementioned <title> element.
//td: selects all the <td> elements
//div[@class="mine"]: selects all div elements which contain an attribute
class="mine"
Selectors have three methods (click on the method to see the complete API documentation).

select(): returns a list of selectors, each of them representing the nodes selected
by the xpath expression given as argument.
extract(): returns a unicode string with the data selected by the XPath selector.
re(): returns a list of unicode strings extracted by applying the regular
expression given as argument.
Extracting the data
hxs.select('//ul/li')
hxs.select('//ul/li/text()').extract() #description
hxs.select('//ul/li/a/text()').extract() #title
hxs.select('//ul/li/a/@href').extract() #links
Crawling
scrapy crawl dmoz
2013-05-06 12:08:02+0700 [scrapy] INFO: Scrapy 0.16.4 started (bot: scrapybot)
2013-05-06 12:08:03+0700 [scrapy] DEBUG: Enabled extensions: FeedExporter,
LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-05-06 12:08:03+0700 [scrapy] DEBUG: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware,
CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware,
DownloaderStats
2013-05-06 12:08:03+0700 [scrapy] DEBUG: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware,
DepthMiddleware
Storing the scraped data
scrapy crawl dmoz -o items.json -t json

[{"url": ["http://www.network-theory.co.uk/python/intro/"],
"name": ["An Introduction to Python"],
"description": ["By Guido van Rossum, Fred L. Drake, Jr.;
Network Theory Ltd., 2003, ISBN 0954161769. Printed edition of official tutorial,
for v2.x, from Python.org. [Network Theory, online]"]},
Other language?
Just write scraping with . in google :D

You might also like