Find your next favorite book

Become a member today and read free for 30 days
Web Scraping with Python

Web Scraping with Python

Read preview

Web Scraping with Python

ratings:
4.5/5 (3 ratings)
Length:
322 pages
2 hours
Released:
Oct 28, 2015
ISBN:
9781782164371
Format:
Book

Description

This book is aimed at developers who want to build reliable solutions to scrape data from websites. It is assumed that the reader has prior programming experience with Python. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principles involved.
Released:
Oct 28, 2015
ISBN:
9781782164371
Format:
Book

About the author


Related to Web Scraping with Python

Related Books
Related Articles

Book Preview

Web Scraping with Python - Richard Lawson

Table of Contents

Web Scraping with Python

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Checking robots.txt

Examining the Sitemap

Estimating the size of a website

Identifying the technology used by a website

Finding the owner of a website

Crawling your first website

Downloading a web page

Retrying downloads

Setting a user agent

Sitemap crawler

ID iteration crawler

Link crawler

Advanced features

Parsing robots.txt

Supporting proxies

Throttling downloads

Avoiding spider traps

Final version

Summary

2. Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Regular expressions

Beautiful Soup

Lxml

CSS selectors

Comparing performance

Scraping results

Overview

Adding a scrape callback to the link crawler

Summary

3. Caching Downloads

Adding cache support to the link crawler

Disk cache

Implementation

Testing the cache

Saving disk space

Expiring stale data

Drawbacks

Database cache

What is NoSQL?

Installing MongoDB

Overview of MongoDB

MongoDB cache implementation

Compression

Testing the cache

Summary

4. Concurrent Downloading

One million web pages

Parsing the Alexa list

Sequential crawler

Threaded crawler

How threads and processes work

Implementation

Cross-process crawler

Performance

Summary

5. Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Edge cases

Rendering a dynamic web page

PyQt or PySide

Executing JavaScript

Website interaction with WebKit

Waiting for results

The Render class

Selenium

Summary

6. Interacting with Forms

The Login form

Loading cookies from the web browser

Extending the login script to update content

Automating forms with the Mechanize module

Summary

7. Solving CAPTCHA

Registering an account

Loading the CAPTCHA image

Optical Character Recognition

Further improvements

Solving complex CAPTCHAs

Using a CAPTCHA solving service

Getting started with 9kw

9kw CAPTCHA API

Integrating with registration

Summary

8. Scrapy

Installation

Starting a project

Defining a model

Creating a spider

Tuning settings

Testing the spider

Scraping with the shell command

Checking results

Interrupting and resuming a crawl

Visual scraping with Portia

Installation

Annotation

Tuning a spider

Checking results

Automated scraping with Scrapely

Summary

9. Overview

Google search engine

Facebook

The website

The API

Gap

BMW

Summary

Index

Web Scraping with Python


Web Scraping with Python

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2015

Production reference: 1231015

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-436-4

www.packtpub.com

Credits

Author

Richard Lawson

Reviewers

Martin Burch

Christopher Davis

William Sankey

Ayush Tiwari

Acquisition Editor

Rebecca Youé

Content Development Editor

Akashdeep Kundu

Technical Editors

Novina Kewalramani

Shruti Rawool

Copy Editor

Sonia Cheema

Project Coordinator

Milton Dsouza

Proofreader

Safis Editing

Indexer

Mariammal Chettiar

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

About the Author

Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing at web scraping while traveling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational at Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones.

I would like to thank Professor Timothy Baldwin for introducing me to this exciting field and Tharavy Douc for hosting me in Paris while I wrote this book.

About the Reviewers

Martin Burch is a data journalist based in New York City, where he makes interactive graphics for The Wall Street Journal. He holds a master of arts in journalism from the City University of New York's Graduate School of Journalism, and has a baccalaureate from New Mexico State University, where he studied journalism and information systems.

I would like to thank my wife, Lisa, who encouraged me to assist with this book; my uncle, Michael, who has always patiently answered my programming questions; and my father, Richard, who inspired my love of journalism and writing.

William Sankey is a data professional and hobbyist developer who lives in College Park, Maryland. He graduated in 2012 from Johns Hopkins University with a master's degree in public policy and specializes in quantitative analysis. He is currently a health services researcher at L&M Policy Research, LLC, working on projects for the Centers for Medicare and Medicaid Services (CMS). The scope of these projects range from evaluating Accountable Care Organizations to monitoring the Inpatient Psychiatric Facility Prospective Payment System.

I would like to thank my devoted wife, Julia, and rambunctious puppy, Ruby, for all their love and support.

Ayush Tiwari is a Python developer and undergraduate at IIT Roorkee. He has been working at Information Management Group, IIT Roorkee, since 2013, and has been actively working in the web development field. Reviewing this book has been a great experience for him. He did his part not only as a reviewer, but also as an avid learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping.

He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized Python e-commerce web scraper (at Miranj).

He has also been handling a placement portal with the help of a Django app to assist the placement process at IIT Roorkee.

Besides backend development, he loves to work on computational Python/data analysis using Python libraries, such as NumPy, SciPy, and is currently working in the CFD research field. You can visit his projects on GitHub. His username is tiwariayush.

He loves trekking through Himalayan valleys and participates in several treks every year, adding this to his list of interests, besides playing the guitar. Among his accomplishments, he is a part of the internationally acclaimed Super 30 group and has also been a rank holder in it. When he was in high school, he also qualified for the International Mathematical Olympiad.

I have been provided a lot of help by my family members (my sister, Aditi, my parents, and Anand sir), my friends at VI and IMG, and my professors. I would like to thank all of them for the support they have given me.

Last but not least, kudos to the respected author and the Packt Publishing team for publishing these fantastic tech books. I commend all the hard work involved in producing their books.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

The Internet contains the most useful set of data ever assembled, which is largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be extracted to be useful. This process of extracting data from web pages is known as web scraping and is becoming increasingly useful as ever more information is available online.

What this book covers

Chapter 1, Introduction to Web Scraping, introduces web scraping and explains ways to crawl a website.

Chapter 2, Scraping the Data, shows you how to extract data from web pages.

Chapter 3, Caching Downloads, teaches you how to avoid redownloading by caching results.

Chapter 4, Concurrent Downloading, helps you to scrape data faster by downloading in parallel.

Chapter 5, Dynamic Content, shows you how to extract data from dynamic websites.

Chapter 6, Interacting with Forms, shows you how to work with forms to access the data you are after.

Chapter 7, Solving CAPTCHA, elaborates how to access data that is protected by CAPTCHA images.

Chapter 8, Scrapy, teaches you how to use the popular high-level Scrapy framework.

Chapter 9, Overview, is an overview of web scraping techniques that have been covered.

What you need for this book

All the code used in this book has been tested with Python 2.7, and is available for download at http://bitbucket.org/wswp/code. Ideally, in a future version of this book, the examples will be ported to Python 3. However, for now, many of the libraries required (such as Scrapy/Twisted, Mechanize, and Ghost) are only available for Python 2. To help illustrate the crawling examples, we created a sample website at http://example.webscraping.com. This website limits how fast you can download content, so if you prefer to host this yourself the source code and installation instructions are available at http://bitbucket.org/wswp/places.

We decided to build a custom website for many of the examples used in this book instead of scraping live websites, so that we have full control over the environment. This provides us stability—live websites are updated more often than books, and by the time you try a scraping example, it may no longer work. Also, a custom website allows us to craft examples that illustrate specific skills and avoid distractions. Finally, a

You've reached the end of this preview. Sign up to read more!
Page 1 of 1

Reviews

What people think about Web Scraping with Python

4.7
3 ratings / 0 Reviews
What did you think?
Rating: 0 out of 5 stars

Reader reviews