Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Open Data Cook Book Recipes (Draft 0.1) Scraper Wiki

Open Data Cook Book Recipes (Draft 0.1) Scraper Wiki

Ratings: (0)|Views: 323 |Likes:
Published by Tim Davies
A how to on using Scraper Wiki prepared as a first draft for the OpenDataCookBook.net project.
A how to on using Scraper Wiki prepared as a first draft for the OpenDataCookBook.net project.

More info:

Published by: Tim Davies on Dec 02, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/02/2010

pdf

text

original

 
Recipes: Scraper Wiki
ScraperWiki is a service that helps you to gather data from websites that do not provide it as raw data. ScraperWiki provides a programming environment where you can write and share a scraper from your browser. ScraperWiki will run your scraper for you once a day,and will make the results available to download and through Application Programming Interfaces (API) for other web programs to use as well.
You will need:
An account atwww.scraperwiki.com(free)Some programming experienceA website with structured information on that youwant to scrape
1)Explore the structure of the website you areplanning to scrape
In this example I
ʼ
m looking at the location ofGarages to Rent in Oxford City. First I checkwhen viewing the page that the elements I wantto scrape are presented fairly uniformly (e.g.there is always the same title for the samething) as lots of variation in the way similarthings are presented makes for difficultscraping.Secondly, I take a look at the source code of theweb page to explore whether each
ʻ
field
ʼ
I wantto scrape (e.g. Postcode; Picture etc.) iscontained neatly in it
ʼ
s own HTML element. Inthis case, whilst each listing is in a <div> html element, a lot of the rest of the text is onlyseparated by line-breaks.I
ʼ
ve used the FireBug plugin for Firefox web browser to look at the structure of the page,as it allows me to explore in more detail than the standard
ʻ
View Source
ʼ
feature onmost browsers.
2)Create a new Scraper on Scraper Wiki
I
ʼ
m going to be creating a PHP scraper asthis is the programming language I
ʼ
m mostcomfortable with, but you can also createscrapers using Python and Ruby languages.The PHP Startup Scraper will load withsome basic code for fetching a webpageand starting to parse it already set. It makesuse of the simple_html_dom library whichallows you to access elements of web pagesusing simple selectors.Change the default URL so scraperwiki isfinding the page you are interested in. Then also change the line
ʻ
foreach
(
$dom 
->find('td')
as$data
)
ʼ
using a selector identified in your earlier exploration to
Drafts for the Open Data Cook Book - seehttp://www.opendatacookbook.net 
 
see if you can pick out the elements you want to scrape.For example, each of the listings of Garages for Rent in Oxford are contained within adiv with the class
ʻ
pagewidget
ʼ
, so I can use the selector
$dom 
->find('div.pagewidget')
to locate them. (This sort of selector will be familiar toanyone used to working with CSS - Cascading Style Sheets).
3)Check what Scraper Wiki returns and start refining your scraper
If you click
ʻ
Run
ʼ
below your scraper you should now see a range of elements returnedin the console. The default PHP template loops through all the elements that match theselector we just set and prints them out to the console.My scraper returns quite a few elements I don
ʼ
t want (there must be more than just theGarage listings picked out by the div.pagewidget selector), so I look for somethinguniform about the elements I do want. In this case they all start with
ʻ
Site Location
ʼ
(or atleast the plaintext versions of them, as returned by
$data
->plaintext
do.I can now add some conditional code to my scraper to only carry on processing thoseelements that contain
ʻ
Site Location
ʼ
. I
ʼ
ve chosen to use the
ʻ
stristr
ʼ
function on PHP that just checks if one string is contained in another and is case insensitive, rather thanchecking the exact position of the phrase, to be tolerant in case there is variation in theway the data is presented that I
ʼ
ve not spotted.
4)Loop, slice and dice
The next steps will depend on how your data is formatted. You may have lots morenested selectors to work through to pick out the elements you want. You can use
$data
just like the
$dom 
object earlier. So, for example, we can use
$data
->find
("img"
,0
)
->src;
to return the
ʻ
src
ʼ
attribute of the first (0) image element (img) wefind in each garage listing.Sometimes, you get down to text which isn
ʼ
t nicely formatted in HTML, and then you willneed to use different string processing to pull apart the bits you want. For example, inthe Garage listings we can separate each line of plain text by splitting the text by <br>
Drafts for the Open Data Cook Book - seehttp://www.opendatacookbook.net 
 
elements, and then splitting each line at the colon
ʻ
:
ʼ
used to separate titles and values.A check of the raw source shows the Oxford Garages page uses both <BR> and <br />as elements so we can use a replace function to standardise these (or we could useregular expressions for splitting).In the Oxford Garages case as well, our data is split across multiple pages, so once wehave the scraper for a single page working right, we can nest it inside a scraper thatgrabs the list of pages and loops through those too. Scraper Wiki also includes usefulhelper code for working with forms, for sites where you have to submit searches ormake selections in forms to view any data.
5)Save each section of scraped data for use later
Towards the end of each loop through the elements you are scraping (each row in yourfinal dataset) you will need to call the
scraperwiki::save()
 
function. This takesfour paramaters:Firstly, an array indicating the name of the unique key in your data that should be usedto work out whether a record is new, or an update to an existing record.Second, an array of data values to save.Third, the date of the record (for indexing). Leave as null to just use the date the scraperwas run.Fourth, an array of latitude and longitude if you have geocoded your data.Run you scraper and check the
ʻ
data
ʼ
tab to see what is being saved.
6)(Optional) Sprinkle in some geocoding as required
If you have a UK postcode in your data then you can use the
scraperwiki::gb_postcode_to_latlng();
function to turn it into a latitudeand longitude, and then save then into your generated dataset.For example, we can use
$lat_lng
=scraperwiki::gb_postcode_to_latlng(
$values
['Postcode']);
andthen when we save our data we add the $lat_lng values to the end of the save function.
Drafts for the Open Data Cook Book - seehttp://www.opendatacookbook.net 

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->