Theory of Data Scraping

Data scraping
Data scraping is a technique in which a computer pro- ing referred to the practice of reading text data from a
gram extracts data from human-readable output coming computer display terminal's screen. This was generally
from another program.
done by reading the terminals memory through its auxiliary port, or by connecting the terminal output port of one
computer system to an input port on another. The term
screen scraping is also commonly used to refer to the bidi1 Description
rectional exchange of data. This could be the simple cases
where the controlling program navigates through the user
Normally, data transfer between programs is accom- interface, or more complex scenarios where the controlplished using data structures suited for automated pro- ling program is entering data into an interface meant to
cessing by computers, not people. Such interchange be used by a human.
formats and protocols are typically rigidly structured,
well-documented, easily parsed, and keep ambiguity to As a concrete example of a classic screen scraper, cona minimum. Very often, these transmissions are not sider a hypothetical legacy system dating from the 1960s
the dawn of computerized data processing. Computer
human-readable at all.[1]
to user interfaces from that era were often simply textThus, the key element that distinguishes data scraping based dumb terminals which were not much more than
from regular parsing is that the output being scraped was virtual teleprinters (such systems are still in use today, for
intended for display to an end-user, rather than as in- various reasons). The desire to interface such a system to
put to another program, and is therefore usually neither more modern systems is common. A robust solution will
documented nor structured for convenient parsing. Data often require things no longer available, such as source
scraping often involves ignoring binary data (usually im- code, system documentation, APIs, or programmers with
ages or multimedia data), display formatting, redundant experience in a 50-year-old computer system. In such
labels, superuous commentary, and other information cases, the only feasible solution may be to write a screen
which is either irrelevant or hinders automated process- scraper which pretends to be a user at a terminal. The
ing.
screen scraper might connect to the legacy system via
Data scraping is most often done to either interface to Telnet, emulate the keystrokes needed to navigate the old
a legacy system which has no other mechanism which user interface, process the resulting display output, exis compatible with current hardware, or to interface to tract the desired data, and pass it on to the modern sysa third-party system which does not provide a more con- tem. (A sophisticated and resilient implementation of
venient API. In the second case, the operator of the third- this kind, built on a platform providing the governance
party system will often see screen scraping as unwanted, and control required by a major enterprise e.g. change
due to reasons such as increased system load, the loss of control, security, user management, data protection, opadvertisement revenue, or the loss of control of the infor- erational audit, load balancing and queue management,
etc. could be said to be an example of robotic aumation content.
tomation software.)
Data scraping is generally considered an ad hoc, inelegant
technique, often used only as a last resort when no other In the 1980s, nancial data providers such as Reuters,
mechanism for data interchange is available. Aside from Telerate, and Quotron displayed data in 2480 format
the higher programming and processing overhead, output intended for a human reader. Users of this data, particudisplays intended for human consumption often change larly investment banks, wrote applications to capture and
structure frequently. Humans can cope with this easily, convert this character data as numeric data for inclusion
but computer programs will often crash or produce incor- into calculations for trading decisions without re-keying
the data. The common term for this practice, especially
rect results.
in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper
shredder. Internally Reuters used the term 'logicized' for
2 Screen scraping
this conversion process, running a sophisticated computer
system on VAX/VMS called the Logicizer.[2]
Screen scraping is normally associated with the programmatic collection of visual data from a source, instead of More modern screen scraping techniques include capturparsing data as in web scraping. Originally, screen scraping the bitmap data from the screen and running it through
1
7 FURTHER READING
an OCR engine, or in the case of GUI applications, querying the graphical controls by programmatically obtaining
references to their underlying programming objects.
Information extraction
Importer (computing)
Web scraping
Web scraping
Mashup (web application hybrid)

Metadata
Main article: Web scraping

Web pages are built using text-based mark-up languages
(HTML and XHTML), and frequently contain a wealth
of useful data in text form. However, most web pages are
designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data
from a web site. Companies like Amazon AWS, Google
provide web scraping tools, services and public data available free of cost to end users.
Newer forms of web scraping involve listening to data
feeds from web servers. For example JSON is commonly
used as a transport storage mechanism between the client
and the web server.
Recently, companies have developed web scraping systems that rely on using techniques in computer vision and
natural language processing to simulate the human processing that occurs when viewing a webpage to automatically extract useful information. [3]
Report mining
Report mining is the extraction of data from human

readable computer reports. Conventional data extraction
requires a connection to a working source system, suitable
connectivity standards or an API, and usually complex
querying. By using the source systems standard reporting options, and directing the output to a spool le instead
of to a printer, static reports can be generated suitable
for oine analysis via report mining.[4] This approach
can avoid intensive CPU usage during business hours, can
minimise end-user licence costs for ERP customers, and
can oer very rapid prototyping and development of custom reports. Whereas data scraping and web scraping
involve interacting with dynamic output, report mining
involves extracting data from les in a human readable
format, such as HTML, PDF, or text. These can be easily generated from almost any system by intercepting the
data feed to a printer. This approach can provide a quick
and simple route to obtaining data without needing to program an API to the source system.
See also
Data munging
Comparison of feed aggregators
6 References
[1] Custom web crawlers and data scraping. Bot Gurus.
[2] http://www.fxweek.com/fx-week/news/1539599/
contributors-fret-about-reuters-plan-to-switch-from-monitor-network-to-idn
FX Week, 02 Nov 1990
[3] Dibot aims to make it easier for apps to read Web pages
the way humans do. MIT Technology Review. Retrieved
1 December 2014.
[4] Scott Steinacher, DataPump transforms host data, InfoWorld, 30 August 1999, p55
7 Further reading
Hemenway, Kevin and Calishain, Tara. Spidering
Hacks. Cambridge, Massachusetts: O'Reilly, 2003.
ISBN 0-596-00577-6.
Text and image sources, contributors, and licenses
8.1
Text
Data scraping Source: http://en.wikipedia.org/wiki/Data%20scraping?oldid=639848100 Contributors: Edward, Kku, Ronz, Psychonaut,

Beland, Diego Moya, Mindmatrix, LOL, Josh Parris, XP1, SchuminWeb, Intgr, Mrnatural, Wavelength, Bhny, Hydrargyrum, Reyk, Donert, Katieh5584, Tom Morris, Arny, UKER, Wizard191, Uker, Lklundin, Destynova, Bonadea, Jackfork, Johnuniq, Apparition11, Addbot,
Yobot, Bunnyhop11, AnomieBOT, Phlyght, Paralympic, FrescoBot, Jonthralow, Mark Renier, Adityapatel, Greenbar, Khazakistyle, Atlas242131, Dewritech, K6ka, Lateg, Dheerajjuneja, MainFrame, DaTribe, MOAI12, Swiss Mister in NY, ClueBot NG, Widr, Frze, BattyBot, Fragga, Kephir, Fedelis4198, JaconaFrere, Davidhart007, Soa Koutsouveli, Andydudley01, Larsson1414, Vijaygirija and Anonymous: 40
8.2
Images
File:Question_book-new.svg Source: http://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0

Contributors:
Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist:
Tkgd2007
8.3
Content license
Creative Commons Attribution-Share Alike 3.0

Theory of Data Scraping

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Theory of Data Scraping

Uploaded by

Copyright:

Available Formats

Data scraping

Mashup (web application hybrid)

Main article: Web scraping

Report mining is the extraction of data from human

Comparison of feed aggregators

Text and image sources, contributors, and licenses

Data scraping Source: http://en.wikipedia.org/wiki/Data%20scraping?oldid=639848100 Contributors: Edward, Kku, Ronz, Psychonaut,

File:Question_book-new.svg Source: http://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0

Creative Commons Attribution-Share Alike 3.0

You might also like