You are on page 1of 2

Statistics of the Common Crawl Corpus 2012

Sebastian Spiegler, Data Scientist at SwiftKey

June 2013

he Common Crawl1 is a non-profit foun-

T dation dedicated to providing an open


repository of web crawl data that can
be accessed and analysed by everyone. The 65
Terabytes
Compressed
857
Thousand
ARC files
210
Terabytes
3.83
Billion
Documents

foundation crawled the web in 2008, 2009, Data


Content

2010 and 2012. The focus of this article is 41.4


Million
an exploratory analysis of the latest crawl. Domains

Figure 1: Data flow of experiment.


Introduction
The Common Crawl (CC) corpus allows individu- Experiment
als or businesses to cost-effectively access terabytes
of web crawl data using Amazon web services like The 2012 corpus is made up of 857 thousand ARC
Elastic MapReduce2 . files which are stored in s3://aws-publicdatasets/
At SwiftKey3 , an innovative London-based start- common-crawl/parse-output/. Each ARC file is
up, we build world class language technology, such as compressed and contains multiple entries of crawled
our award-winning Android soft keyboard. Amongst documents. An entry consists of a header and the
other features, the keyboard delivers multilingual actual web document.4,5
error correction, word completion, next word predic-
tion and space inference for up to three languages Extracted data
concurrently. At present, we support more than
60 languages, with new languages being constantly For the purpose of this analysis, we have extracted
added to the list. The CC corpus represents an excel- the following fields for each web document in the
lent addition to our internal data sources for building 2012 corpus:
and testing language models and for our research.
• The public suffix. The public suffix is the level
To better understand the content and the structure under which a user can register a private domain.
of the 2012 corpus, we carried out the exploratory An up-to-date list is maintained by the Mozilla
analysis at hand. The remainder of the article is Foundation.6 The public suffix of ‘bbc.co.uk’
organised as follows. We will start with a short would be ‘.co.uk’ whereas ‘.uk’ is the top-level
overview of the experimental setup and subsequently domain (TLD). Although we will be using TLDs
examine our results.
4
More details on the data format can be found under http:
1
The official website of the Common Crawl Foundation is //commoncrawl.org/data/accessing-the-data/.
5
http://commoncrawl.org. The crawler truncated the content of fetched web documents
2
See aws.amazon.com/elasticmapreduce/ for more details. at 2 megabytes.
3 6
Our official website is http://www.swiftkey.net/. See http://publicsuffix.org/.

Page 1 of 6
25
s3://aws-publicdatasets/common-crawl/
Thousand
ARC files parse-output/segment/[segment]/[ARC_
file]. It allows to link a given web document
to the ARC file it is stored in. ARC file names
EMR Hadoop Cluster
35 x x 6h / cluster = 1260 inst. hours are unique so the segment name is not necessary
Core
m1.xlarge
Master
m1.xlarge
Core
m1.xlarge
for identification.
spot inst. spot inst. spot inst.

Core Core Core


• The byte size. The byte size is the number of
m1.xlarge
spot inst.
m1.xlarge
spot inst.
m1.xlarge
spot inst.
raw bytes of the document’s content. We will
be summing this value for multiple documents
of the same SLD and the entire corpus to make
241 assumptions about the data distribution.
Gigabytes
Extracted
Information
Setup
EMR Hive Cluster
Figure 1 shows the overall data flow of the experi-
Core
m1.xlarge
Master
m1.xlarge
Core
m1.xlarge
ment. The 2012 corpus8 consists of 210 terabytes
spot inst. spot inst. spot inst.
of web data which was processed to extract the
x 15h = 90 inst. hours
Core Core Core fields listed above. This resulted in a 241 giga-
m1.xlarge m1.xlarge m1.xlarge
spot inst. spot inst. spot inst. byte summary of 3.83 billion documents correspond-
ing to 41.4 million distinct second-level domains.
The non-aggregated data is accessible at s3://
Figure 2: Experiment setup.
aws-publicdatasets/common-crawl/index2012/.
For the experiment we made two major decisions.
rather than public suffixes during our investi- Instead of processing all ARC files at once, we split
gation we thought the additional information the 2012 corpus into manageable subsets of 25 thou-
might still be helpful for later analyses. sand files, processed them individually and later
combined intermediate results. Furthermore, we
• The second-level domain (SLD). This domain is chose the format of tab-separated values for the non-
directly below the public suffix and can be reg- aggregated data – the long list of entries with public
istered by an individual or organization. In our suffix, second-level domain, content type, encoding,
previous example ‘bbc.co.uk’ the SLD would file name and byte size – which would allow us to
be ‘bbc’. easily run SQL-like queries using Apache Hive 9 later
on.
• The internet media type or content type. It is The actual experiment took approximately 1500
used to specify the content in various internet instance hours totalling in about US$ 200 including
protocols and consists of a type and sub-type the use of EC2 spot instances and data transfer from
component. Examples are text/html, text/xml, and to S3. Along with development and testing we
application/pdf or image/jpeg.7 spent about three times this figure. This makes the
• The character encoding. The encoding describes Common Crawl corpus very accessible, especially to
how one or more bytes are mapped to charac- start-ups like SwiftKey. A summary of the experi-
ters of a character set, a collection of letters mental setup is shown in Figure 2.
and symbols. If the encoding is unknown or an
incorrect encoding is applied, a byte sequence Exploratory analysis
cannot be restored to its original text. Exam-
ples of character sets are ‘ASCII ’ for English After extracting the public suffix, second-level do-
text, ‘ISO-8859-6 ’ for Arabic or ‘UTF-8 ’ for main, content type, encoding, ARC file name and
most world languages. byte size of 3.83 billion web documents we wanted
to answer general questions concerning the distribu-
• The ARC file name is the last component tion of domains, media types and encodings but also
of the following uniform resource identifier
8
The 2012 corpus can be found here: s3://
7
See the Internet Assigned Numbers Authority for more de- aws-publicdatasets/common-crawl/parse-output/.
9
tails: http://www.iana.org/assignments/media-types. See http://hive.apache.org/ for more information.

Page 2 of 6

You might also like