Professional Documents
Culture Documents
net/publication/317177787
Web Scraping
CITATIONS READS
64 61,079
1 author:
Bo Zhao
University of Washington Seattle
60 PUBLICATIONS 1,198 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bo Zhao on 30 December 2018.
scraping HTML and other XML documents. It web data integration. For examples, at a micro-
provides convenient Pythonic functions for navi- scale, the price of a stock can be regularly scraped
gating, searching, and modifying a parse tree; a in order to visualize the price change over time
toolkit for decomposing an HTML file and extra- (Case et al. 2005), and social media feeds can be
cting desired information via lxml or html5lib. collectively scraped to investigate public opinions
Beautiful Soup can automatically detect the and identify opinion leaders (Liu and Zhao 2016).
encoding of the parsing under processing and At a macro-level, the metadata of nearly every
convert it to a client-readable encode. Similarly, website is constantly scraped to build up Internet
Pyquery provides a set of Jquery-like functions to search engines, such as Google Search or Bing
parse xml documents. But unlike Beautiful Soup, Search (Snyder 2003).
Pyquery only supports lxml for fast XML Although web scraping is a powerful technique
processing. in collecting large data sets, it is controversial and
Of the various types of web scraping programs, may raise legal questions related to copyright
some are created to automatically recognize the (O’Reilly 2006), terms of service (ToS) (Fisher
data structure of a page, such as Nutch or Scrapy, et al. 2010), and “trespass to chattels” (Hirschey
or to provide a web-based graphic interface that 2014). A web scraper is free to copy a piece of data
eliminates the need for manually written web in figure or table form from a web page without
scraping code, such as Import.io. Nutch is a robust any copyright infringement because it is difficult
and scalable web crawler, written in Java. It to prove a copyright over such data since only a
enables fine-grained configuration, paralleling specific arrangement or a particular selection of
harvesting, robots.txt rule support, and machine the data is legally protected. Regarding the ToS,
learning. Scrapy, written in Python, is an reusable although most web applications include some
web crawling framework. It speeds up the process form of ToS agreement, their enforceability usu-
of building and scaling large crawling projects. In ally lies within a gray area. For instance, the
addition, it also provides a web-based shell to owner of a web scraper that violates the ToS
simulate the website browsing behaviors of a may argue that he or she never saw or officially
human user. To enable nonprogrammers to har- agreed to the ToS. Moreover, if a web scraper
vest web contents, the web-based crawler with a sends data acquiring requests too frequently, this
graphic interface is purposely designed to mitigate is functionally equivalent to a denial-of-service
the complexity of using a web scraping program. attack, in which the web scraper owner may be
Among them, Import.io is a typical crawler for refused entry and may be liable for damages under
extracting data from websites without writing any the law of “trespass to chattels,” because the
code. It allows users to identify and convert owner of the web application has a property inter-
unstructured web pages into a structured format. est in the physical web server which hosts the
Import.io’s graphic interface for data identifica- application. An ethical web scraping tool will
tion allows user to train and learn what to extract. avoid this issue by maintaining a reasonable
The extracted data is then stored in a dedicated requesting frequency.
cloud server, and can be exported in CSV, JSON, A web application may adopt one of the fol-
and XML format. A web-based crawler with a lowing measures to stop or interfere with a web
graphic interface can easily harvest and visualize scrapping tool that collects data from the given
real-time data stream based on SVG or WebGL website. Those measures may identify whether an
engine but fall short in manipulating a large data operation was conducted by a human being or a
set. bot. Some of the major measures include the fol-
Web scraping can be used for a wide variety of lowing: HTML “fingerprinting” that investigates
scenarios, such as contact scraping, price change the HTML headers to identify whether a visitor is
monitoring/comparison, product review collec- malicious or safe (Acar et al. 2013); IP reputation
tion, gathering of real estate listings, weather determination, where IP addresses with a
data monitoring, website change detection, and recorded history of use in website assaults that
Web Scraping 3
will be treated with suspicion and are more likely Fisher, D., Mcdonald, D. W., Brooks, A. L., & Churchill,
to be heavily scrutinized (Sadan and Schwartz E. F. (2010). Terms of service, ethics, and bias: Tapping
the social web for CSCW research. Computer
2012); behavior analysis for revealing abnormal Supported Cooperative Work (CSCW), Panel
behavioral patterns, such as placing a suspiciously discussion.
high rate of requests and adhering to anomalous Hirschey, J. K. (2014). Symbiotic relationships: Pragmatic
browsing patterns; and progressive challenges acceptance of data scraping. Berkeley Technology Law
Journal, 29, 897.
that filter out bots with a set of tasks, such as Liu, J. C.-E., & Zhao, B. (2016). Who speaks for climate
cookie support, JavaScript execution, and change in China? Evidence from Weibo. Climatic
CAPTCHA (Doran and Gokhale 2011). Change, 140(3), 413–422.
Mooney, S. J., Westreich, D. J., & El-Sayed, A. M. (2015).
Epidemiology in the era of big data. Epidemiology,
26(3), 390.
Further Readings O’Reilly, S. (2006). Nominative fair use and Internet
aggregators: Copyright and trademark challenges
Acar, G., Juarez, M., Nikiforakis, N., Diaz, C., Gürses, S., posed by bots, web crawlers and screen-scraping tech-
Piessens, F., & Preneel, B. (2013). Fpdetective: Dusting nologies. Loyola Consumer Law Review, 19, 273.
the web for fingerprinters. In Proceedings of the 2013 Sadan, Z., & Schwartz, D. G. (2012). Social network
ACM SIGSAC conference on computer & communica- analysis for cluster-based IP spam reputation. Informa-
tions security. New York: ACM. tion Management & Computer Security, 20(4),
Bar-Ilan, J. (2001). Data collection methods on the web for 281–295.
infometric purposes – A review and analysis. Snyder, R. (2003). Web search engine with graphic snap-
Scientometrics, 50(1), 7–32. shots. Google Patents.
Butler, J. (2007). Visual web page analytics. Google Yi, J., Nasukawa, T., Bunescu, R., & Niblack, W. (2003).
Patents. Sentiment analyzer: Extracting sentiments about a
Case, K. E., Quigley, J. M., & Shiller, R. J. (2005). Com- given topic using natural language processing tech-
paring wealth effects: The stock market versus the niques. Data Mining, 2003. ICDM 2003. Third IEEE
housing market. The BE Journal of Macroeconomics, International Conference on, IEEE. Melbourne,
5(1), 1. Florida, USA.
Doran, D., & Gokhale, S. S. (2011). Web robot detection
techniques: Overview and limitations. Data Mining
and Knowledge Discovery, 22(1), 183–210.