Professional Documents
Culture Documents
Theory of Data Scraping
Theory of Data Scraping
Data scraping is a technique in which a computer pro- ing referred to the practice of reading text data from a
gram extracts data from human-readable output coming computer display terminal's screen. This was generally
from another program.
done by reading the terminals memory through its auxiliary port, or by connecting the terminal output port of one
computer system to an input port on another. The term
screen scraping is also commonly used to refer to the bidi1 Description
rectional exchange of data. This could be the simple cases
where the controlling program navigates through the user
Normally, data transfer between programs is accom- interface, or more complex scenarios where the controlplished using data structures suited for automated pro- ling program is entering data into an interface meant to
cessing by computers, not people. Such interchange be used by a human.
formats and protocols are typically rigidly structured,
well-documented, easily parsed, and keep ambiguity to As a concrete example of a classic screen scraper, cona minimum. Very often, these transmissions are not sider a hypothetical legacy system dating from the 1960s
the dawn of computerized data processing. Computer
human-readable at all.[1]
to user interfaces from that era were often simply textThus, the key element that distinguishes data scraping based dumb terminals which were not much more than
from regular parsing is that the output being scraped was virtual teleprinters (such systems are still in use today, for
intended for display to an end-user, rather than as in- various reasons). The desire to interface such a system to
put to another program, and is therefore usually neither more modern systems is common. A robust solution will
documented nor structured for convenient parsing. Data often require things no longer available, such as source
scraping often involves ignoring binary data (usually im- code, system documentation, APIs, or programmers with
ages or multimedia data), display formatting, redundant experience in a 50-year-old computer system. In such
labels, superuous commentary, and other information cases, the only feasible solution may be to write a screen
which is either irrelevant or hinders automated process- scraper which pretends to be a user at a terminal. The
ing.
screen scraper might connect to the legacy system via
Data scraping is most often done to either interface to Telnet, emulate the keystrokes needed to navigate the old
a legacy system which has no other mechanism which user interface, process the resulting display output, exis compatible with current hardware, or to interface to tract the desired data, and pass it on to the modern sysa third-party system which does not provide a more con- tem. (A sophisticated and resilient implementation of
venient API. In the second case, the operator of the third- this kind, built on a platform providing the governance
party system will often see screen scraping as unwanted, and control required by a major enterprise e.g. change
due to reasons such as increased system load, the loss of control, security, user management, data protection, opadvertisement revenue, or the loss of control of the infor- erational audit, load balancing and queue management,
etc. could be said to be an example of robotic aumation content.
tomation software.)
Data scraping is generally considered an ad hoc, inelegant
technique, often used only as a last resort when no other In the 1980s, nancial data providers such as Reuters,
mechanism for data interchange is available. Aside from Telerate, and Quotron displayed data in 2480 format
the higher programming and processing overhead, output intended for a human reader. Users of this data, particudisplays intended for human consumption often change larly investment banks, wrote applications to capture and
structure frequently. Humans can cope with this easily, convert this character data as numeric data for inclusion
but computer programs will often crash or produce incor- into calculations for trading decisions without re-keying
the data. The common term for this practice, especially
rect results.
in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper
shredder. Internally Reuters used the term 'logicized' for
2 Screen scraping
this conversion process, running a sophisticated computer
system on VAX/VMS called the Logicizer.[2]
Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of More modern screen scraping techniques include capturparsing data as in web scraping. Originally, screen scrap- ing the bitmap data from the screen and running it through
1
7 FURTHER READING
an OCR engine, or in the case of GUI applications, querying the graphical controls by programmatically obtaining
references to their underlying programming objects.
Information extraction
Importer (computing)
Web scraping
Web scraping
Report mining
See also
Data munging
6 References
[1] Custom web crawlers and data scraping. Bot Gurus.
[2] http://www.fxweek.com/fx-week/news/1539599/
contributors-fret-about-reuters-plan-to-switch-from-monitor-network-to-idn
FX Week, 02 Nov 1990
[3] Dibot aims to make it easier for apps to read Web pages
the way humans do. MIT Technology Review. Retrieved
1 December 2014.
[4] Scott Steinacher, DataPump transforms host data, InfoWorld, 30 August 1999, p55
7 Further reading
Hemenway, Kevin and Calishain, Tara. Spidering
Hacks. Cambridge, Massachusetts: O'Reilly, 2003.
ISBN 0-596-00577-6.
8.1
Text
8.2
Images
8.3
Content license