You are on page 1of 24

Basic web scraping

without
programming
Mark Walker of The New York Times
What is Web
Scraping?
(Don’t you’ve been
doing it all along)
scraping is….
-- Web scraping, also known as web data extraction, is the
process of retrieving or “scraping” data from a website.

-- If you’ve ever copy and pasted information from a website,


you’ve performed the same function as a web scraper
Understandi
ng html
Hypertext Markup Language
(HTML)
Hypertext Markup Language (HTML) is the standard
markup language for documents designed to be
displayed in a web browser.
Import
HTML
Import HTML tables and lists into Google Sheets using the function
“=importhtml(“URL”, “QUERY”, INDEX)

The Formula
Breaking
down the
Formula
URL
URL - The URL of the page to examine, including protocol (e.g.
http://).

The value for url must either be enclosed in quotation marks or be a


reference to a cell containing the appropriate text.
QUERY
Query - Either "list" or "table" depending on what type of
structure contains the desired data.
INDEX
Index - The index, starting at 0, which identifies which table or list as
defined in the HTML source should be returned.

The indices for lists and tables are maintained separately, so there may be
both a list and a table with index 0 if both types of elements exist on the
HTML page.
Other Formulas
IMPORTXML: Imports data from any of various structured data types including XML,
HTML, CSV, TSV, and RSS and ATOM XML feeds.

IMPORTRANGE: Imports a range of cells from a specified spreadsheet.

IMPORTFEED: Imports a RSS or ATOM feed.

IMPORTDATA: Imports data at a given url in .csv (comma-separated value) or .tsv (tab-
separated value) format.

You might also like