Find your next favorite book

Become a member today and read free for 30 days
Web Scraping with Python

Web Scraping with Python

Read preview

Web Scraping with Python

4.5/5 (3 ratings)
322 pages
2 hours
Oct 28, 2015


This book is aimed at developers who want to build reliable solutions to scrape data from websites. It is assumed that the reader has prior programming experience with Python. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principles involved.
Oct 28, 2015

About the author

Related to Web Scraping with Python

Related Books
Related Articles

Book Preview

Web Scraping with Python - Richard Lawson

Table of Contents

Web Scraping with Python


About the Author

About the Reviewers

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders


What this book covers

What you need for this book

Who this book is for


Reader feedback

Customer support




1. Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Checking robots.txt

Examining the Sitemap

Estimating the size of a website

Identifying the technology used by a website

Finding the owner of a website

Crawling your first website

Downloading a web page

Retrying downloads

Setting a user agent

Sitemap crawler

ID iteration crawler

Link crawler

Advanced features

Parsing robots.txt

Supporting proxies

Throttling downloads

Avoiding spider traps

Final version


2. Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Regular expressions

Beautiful Soup


CSS selectors

Comparing performance

Scraping results


Adding a scrape callback to the link crawler


3. Caching Downloads

Adding cache support to the link crawler

Disk cache


Testing the cache

Saving disk space

Expiring stale data


Database cache

What is NoSQL?

Installing MongoDB

Overview of MongoDB

MongoDB cache implementation


Testing the cache


4. Concurrent Downloading

One million web pages

Parsing the Alexa list

Sequential crawler

Threaded crawler

How threads and processes work


Cross-process crawler



5. Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Edge cases

Rendering a dynamic web page

PyQt or PySide

Executing JavaScript

Website interaction with WebKit

Waiting for results

The Render class



6. Interacting with Forms

The Login form

Loading cookies from the web browser

Extending the login script to update content

Automating forms with the Mechanize module


7. Solving CAPTCHA

Registering an account

Loading the CAPTCHA image

Optical Character Recognition

Further improvements

Solving complex CAPTCHAs

Using a CAPTCHA solving service

Getting started with 9kw


Integrating with registration


8. Scrapy


Starting a project

Defining a model

Creating a spider

Tuning settings

Testing the spider

Scraping with the shell command

Checking results

Interrupting and resuming a crawl

Visual scraping with Portia



Tuning a spider

Checking results

Automated scraping with Scrapely


9. Overview

Google search engine


The website






Web Scraping with Python

Web Scraping with Python

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2015

Production reference: 1231015

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-436-4



Richard Lawson


Martin Burch

Christopher Davis

William Sankey

Ayush Tiwari

Acquisition Editor

Rebecca Youé

Content Development Editor

Akashdeep Kundu

Technical Editors

Novina Kewalramani

Shruti Rawool

Copy Editor

Sonia Cheema

Project Coordinator

Milton Dsouza


Safis Editing


Mariammal Chettiar

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

About the Author

Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing at web scraping while traveling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational at Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones.

I would like to thank Professor Timothy Baldwin for introducing me to this exciting field and Tharavy Douc for hosting me in Paris while I wrote this book.

About the Reviewers

Martin Burch is a data journalist based in New York City, where he makes interactive graphics for The Wall Street Journal. He holds a master of arts in journalism from the City University of New York's Graduate School of Journalism, and has a baccalaureate from New Mexico State University, where he studied journalism and information systems.

I would like to thank my wife, Lisa, who encouraged me to assist with this book; my uncle, Michael, who has always patiently answered my programming questions; and my father, Richard, who inspired my love of journalism and writing.

William Sankey is a data professional and hobbyist developer who lives in College Park, Maryland. He graduated in 2012 from Johns Hopkins University with a master's degree in public policy and specializes in quantitative analysis. He is currently a health services researcher at L&M Policy Research, LLC, working on projects for the Centers for Medicare and Medicaid Services (CMS). The scope of these projects range from evaluating Accountable Care Organizations to monitoring the Inpatient Psychiatric Facility Prospective Payment System.

I would like to thank my devoted wife, Julia, and rambunctious puppy, Ruby, for all their love and support.

Ayush Tiwari is a Python developer and undergraduate at IIT Roorkee. He has been working at Information Management Group, IIT Roorkee, since 2013, and has been actively working in the web development field. Reviewing this book has been a great experience for him. He did his part not only as a reviewer, but also as an avid learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping.

He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized Python e-commerce web scraper (at Miranj).

He has also been handling a placement portal with the help of a Django app to assist the placement process at IIT Roorkee.

Besides backend development, he loves to work on computational Python/data analysis using Python libraries, such as NumPy, SciPy, and is currently working in the CFD research field. You can visit his projects on GitHub. His username is tiwariayush.

He loves trekking through Himalayan valleys and participates in several treks every year, adding this to his list of interests, besides playing the guitar. Among his accomplishments, he is a part of the internationally acclaimed Super 30 group and has also been a rank holder in it. When he was in high school, he also qualified for the International Mathematical Olympiad.

I have been provided a lot of help by my family members (my sister, Aditi, my parents, and Anand sir), my friends at VI and IMG, and my professors. I would like to thank all of them for the support they have given me.

Last but not least, kudos to the respected author and the Packt Publishing team for publishing these fantastic tech books. I commend all the hard work involved in producing their books.

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.


The Internet contains the most useful set of data ever assembled, which is largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be extracted to be useful. This process of extracting data from web pages is known as web scraping and is becoming increasingly useful as ever more information is available online.

What this book covers

Chapter 1, Introduction to Web Scraping, introduces web scraping and explains ways to crawl a website.

Chapter 2, Scraping the Data, shows you how to extract data from web pages.

Chapter 3, Caching Downloads, teaches you how to avoid redownloading by caching results.

Chapter 4, Concurrent Downloading, helps you to scrape data faster by downloading in parallel.

Chapter 5, Dynamic Content, shows you how to extract data from dynamic websites.

Chapter 6, Interacting with Forms, shows you how to work with forms to access the data you are after.

Chapter 7, Solving CAPTCHA, elaborates how to access data that is protected by CAPTCHA images.

Chapter 8, Scrapy, teaches you how to use the popular high-level Scrapy framework.

Chapter 9, Overview, is an overview of web scraping techniques that have been covered.

What you need for this book

All the code used in this book has been tested with Python 2.7, and is available for download at Ideally, in a future version of this book, the examples will be ported to Python 3. However, for now, many of the libraries required (such as Scrapy/Twisted, Mechanize, and Ghost) are only available for Python 2. To help illustrate the crawling examples, we created a sample website at This website limits how fast you can download content, so if you prefer to host this yourself the source code and installation instructions are available at

We decided to build a custom website for many of the examples used in this book instead of scraping live websites, so that we have full control over the environment. This provides us stability—live websites are updated more often than books, and by the time you try a scraping example, it may no longer work. Also, a custom website allows us to craft examples that illustrate specific skills and avoid distractions. Finally, a

You've reached the end of this preview. Sign up to read more!
Page 1 of 1


What people think about Web Scraping with Python

3 ratings / 0 Reviews
What did you think?
Rating: 0 out of 5 stars

Reader reviews