You are on page 1of 5

Statistics of the Common Crawl Corpus 2012

Sebastian Spiegler, Data Scientist at SwiftKey

June 2013

he Common Crawl1 is a non-profit foun-

T dation dedicated to providing an open


repository of web crawl data that can
be accessed and analysed by everyone. The 65
Terabytes
Compressed
857
Thousand
ARC files
210
Terabytes
3.83
Billion
Documents

foundation crawled the web in 2008, 2009, Data


Content

2010 and 2012. The focus of this article is 41.4


Million
an exploratory analysis of the latest crawl. Domains

Figure 1: Data flow of experiment.


Introduction
The Common Crawl (CC) corpus allows individu- Experiment
als or businesses to cost-effectively access terabytes
of web crawl data using Amazon web services like The 2012 corpus is made up of 857 thousand ARC
Elastic MapReduce2 . files which are stored in s3://aws-publicdatasets/
At SwiftKey3 , an innovative London-based start- common-crawl/parse-output/. Each ARC file is
up, we build world class language technology, such as compressed and contains multiple entries of crawled
our award-winning Android soft keyboard. Amongst documents. An entry consists of a header and the
other features, the keyboard delivers multilingual actual web document.4,5
error correction, word completion, next word predic-
tion and space inference for up to three languages Extracted data
concurrently. At present, we support more than
60 languages, with new languages being constantly For the purpose of this analysis, we have extracted
added to the list. The CC corpus represents an excel- the following fields for each web document in the
lent addition to our internal data sources for building 2012 corpus:
and testing language models and for our research.
• The public suffix. The public suffix is the level
To better understand the content and the structure under which a user can register a private domain.
of the 2012 corpus, we carried out the exploratory An up-to-date list is maintained by the Mozilla
analysis at hand. The remainder of the article is Foundation.6 The public suffix of ‘bbc.co.uk’
organised as follows. We will start with a short would be ‘.co.uk’ whereas ‘.uk’ is the top-level
overview of the experimental setup and subsequently domain (TLD). Although we will be using TLDs
examine our results.
4
More details on the data format can be found under http:
1
The official website of the Common Crawl Foundation is //commoncrawl.org/data/accessing-the-data/.
5
http://commoncrawl.org. The crawler truncated the content of fetched web documents
2
See aws.amazon.com/elasticmapreduce/ for more details. at 2 megabytes.
3 6
Our official website is http://www.swiftkey.net/. See http://publicsuffix.org/.

Page 1 of 6
25
s3://aws-publicdatasets/common-crawl/
Thousand
ARC files parse-output/segment/[segment]/[ARC_
file]. It allows to link a given web document
to the ARC file it is stored in. ARC file names
EMR Hadoop Cluster
35 x x 6h / cluster = 1260 inst. hours are unique so the segment name is not necessary
Core
m1.xlarge
Master
m1.xlarge
Core
m1.xlarge
for identification.
spot inst. spot inst. spot inst.

Core Core Core


• The byte size. The byte size is the number of
m1.xlarge
spot inst.
m1.xlarge
spot inst.
m1.xlarge
spot inst.
raw bytes of the document’s content. We will
be summing this value for multiple documents
of the same SLD and the entire corpus to make
241 assumptions about the data distribution.
Gigabytes
Extracted
Information
Setup
EMR Hive Cluster
Figure 1 shows the overall data flow of the experi-
Core
m1.xlarge
Master
m1.xlarge
Core
m1.xlarge
ment. The 2012 corpus8 consists of 210 terabytes
spot inst. spot inst. spot inst.
of web data which was processed to extract the
x 15h = 90 inst. hours
Core Core Core fields listed above. This resulted in a 241 giga-
m1.xlarge m1.xlarge m1.xlarge
spot inst. spot inst. spot inst. byte summary of 3.83 billion documents correspond-
ing to 41.4 million distinct second-level domains.
The non-aggregated data is accessible at s3://
Figure 2: Experiment setup.
aws-publicdatasets/common-crawl/index2012/.
For the experiment we made two major decisions.
rather than public suffixes during our investi- Instead of processing all ARC files at once, we split
gation we thought the additional information the 2012 corpus into manageable subsets of 25 thou-
might still be helpful for later analyses. sand files, processed them individually and later
combined intermediate results. Furthermore, we
• The second-level domain (SLD). This domain is chose the format of tab-separated values for the non-
directly below the public suffix and can be reg- aggregated data – the long list of entries with public
istered by an individual or organization. In our suffix, second-level domain, content type, encoding,
previous example ‘bbc.co.uk’ the SLD would file name and byte size – which would allow us to
be ‘bbc’. easily run SQL-like queries using Apache Hive 9 later
on.
• The internet media type or content type. It is The actual experiment took approximately 1500
used to specify the content in various internet instance hours totalling in about US$ 200 including
protocols and consists of a type and sub-type the use of EC2 spot instances and data transfer from
component. Examples are text/html, text/xml, and to S3. Along with development and testing we
application/pdf or image/jpeg.7 spent about three times this figure. This makes the
• The character encoding. The encoding describes Common Crawl corpus very accessible, especially to
how one or more bytes are mapped to charac- start-ups like SwiftKey. A summary of the experi-
ters of a character set, a collection of letters mental setup is shown in Figure 2.
and symbols. If the encoding is unknown or an
incorrect encoding is applied, a byte sequence Exploratory analysis
cannot be restored to its original text. Exam-
ples of character sets are ‘ASCII ’ for English After extracting the public suffix, second-level do-
text, ‘ISO-8859-6 ’ for Arabic or ‘UTF-8 ’ for main, content type, encoding, ARC file name and
most world languages. byte size of 3.83 billion web documents we wanted
to answer general questions concerning the distribu-
• The ARC file name is the last component tion of domains, media types and encodings but also
of the following uniform resource identifier
8
The 2012 corpus can be found here: s3://
7
See the Internet Assigned Numbers Authority for more de- aws-publicdatasets/common-crawl/parse-output/.
9
tails: http://www.iana.org/assignments/media-types. See http://hive.apache.org/ for more information.

Page 2 of 6
TLD Abs. freq. Rel. freq.
.com 2,139,229,462 0.5587
.org 230,777,285 0.0603
.net 208,147,478 0.0544
.de 181,658,774 0.0474
.uk 132,414,696 0.0346
.pl 68,528,722 0.0179
.ru 65,147,873 0.0170
.nl 54,871,489 0.0143
.info 50,395,860 0.0132
.it 49,719,965 0.0130
.fr 49,648,844 0.0130
.jp 43,790,880 0.0114
others 554,450,743 0.1448

Figure 3: Top-level domain distribution based on document frequencies.

Rel. freq. Rel. freq. Rel. freq. Rel. freq.


TLD Ratio TLD Ratio
W3 survey CC corpus W3 survey CC corpus
.gov 0.001 0.0026 2.6 .in 0.009 0.0021 0.2
.nz 0.001 0.0022 2.2 .tk 0.001 0.0002 0.2
.edu 0.003 0.0061 2.0 .th 0.001 0.0002 0.2
.uk 0.019 0.0346 1.8 .kz 0.001 0.0002 0.2
.nl 0.008 0.0143 1.8 .co 0.003 0.0004 0.1
.se 0.003 0.0053 1.8 .az 0.001 0.0001 0.1
.ca 0.004 0.0069 1.7 .asia 0.001 0.0001 0.1
.ch 0.003 0.0052 1.7 .pk 0.001 0.0001 0.1
.cz 0.005 0.0076 1.5 .ve 0.001 0.0001 0.1
.org 0.041 0.0603 1.5 .ir 0.006 0.0005 0.1

(a) Top 10 overrepresented TLDs. (b) Top 10 underrepresented TLDs.

Table 1: Representativeness of TLDs.

understand more about the structure of the corpus the internet and whether it is biased towards certain
and its representativeness. TLDs. The Spearman rank correlation coefficient
gave a value of 0.84 for ρ which indicates a good
Top-level domains positive correlation for the top 75 TLDs.
To extrapolate the bias, we took the ratio of the
One of the main questions was which TLDs had been relative frequency in the CC corpus and the expected
crawled and what their percentage was with respect value from the web technology survey. We labelled
to the total corpus. For this, we aggregated counts of TLDs with values above 1 as overrepresented and the
public suffixes like .org.uk and .co.uk under .uk. ones with values below 1 as underrepresented. The
Figure 3 summarizes these statistics by listing all top 10 over- and underrepresented domains are listed
TLDs above a relative frequency of 0.01, i.e. 1%. in Tables 1a and 1b. Most of the overrepresented
For the 2012 corpus, there are 12 TLDs above this TLDs are domains of English or European countries.
threshold. The underrepresented domains are mostly Asian and
It becomes immediately apparent that more than some are South American.
half of the documents, which have been crawled,
are registered under the .com domain. This can be
explained by the fact that this TLD contains sites Second-level domains
from all over the world. In Table 2a and 2b the top 10 second-level do-
Comparing these figures to the general usage of mains are given by document frequency and by
TLDs for websites provided by the web technology data in terabytes. With 2.5% of all websites and
survey 10 it is possible to make assumptions about the 4.2% of the total data, youtube.com is the high-
representativeness of the CC corpus as a sample of est ranking second-level domain in the CC cor-
10
Source: http://w3techs.com/technologies/overview/ pus 2012. Other high-ranking domains are blog
top_level_domain/all (March 2013). publishing services like blogspot.com, wordpress.

Page 3 of 6
Rank SLD Abs. freq. Rel. freq. Rank SLD terabytes Rel. data
1 youtube.com 95,866,041 0.0250 1 youtube.com 8.7560 0.0417
2 blogspot.com 45,738,134 0.0119 2 wordpress.com 1.0128 0.0048
3 tumblr.com 30,135,714 0.0079 3 flickr.com 0.7500 0.0036
4 flickr.com 9,942,237 0.0026 4 hotels.com 0.2839 0.0014
5 amazon.com 6,470,283 0.0017 5 typepad.com 0.1693 0.0008
6 google.com 2,782,762 0.0007 6 federal-hotel.com 0.1617 0.0008
7 thefreedictionary.com 2,183,753 0.0006 7 shopzilla.com 0.1230 0.0006
8 tripod.com 1,874,452 0.0005 8 shopping.com 0.1210 0.0006
9 hotels.com 1,733,778 0.0005 9 yoox.com 0.1081 0.0005
10 flightaware.com 1,280,875 0.0003 10 tripadvisor.es 0.1074 0.0005

(a) .com domains by document frequency. (b) .com domains by data in terabytes.

Rank SLD Abs. freq. Rel. freq. Rank SLD terabytes Rel. data
1 citysite.net 1194938 0.0003 1 tripadvisor.es 0.1074 0.0005
2 yahoo.co.jp 1022024 0.0003 2 tripadvisor.in 0.1051 0.0005
3 amazon.de 864516 0.0002 3 ca.gov 0.0857 0.0004
4 wrzuta.pl 827315 0.0002 4 epa.gov 0.0803 0.0004
5 dancesportinfo.net 675029 0.0002 5 iha.fr 0.0781 0.0004
6 atwiki.jp 665594 0.0002 6 amazon.fr 0.0768 0.0004
7 weblio.jp 642366 0.0002 7 who.int 0.0763 0.0004
8 blogg.se 611502 0.0002 8 europa.eu 0.0590 0.0003
9 kijiji.ca 608583 0.0002 9 autotrends.be 0.0569 0.0003
10 rakuten.co.jp 564760 0.0001 10 astrogrid.org 0.0555 0.0003

(c) Non-.com domains by document frequency. (d) Non-.com domains by data in terabytes.

Table 2: Top 10 second-level domains (SLD).

com or typepad.com, online shopping sites such Rel. Rel.


Media Abs.
SLD freq. freq.
as amazon.com, shopzilla.com or shopping.com, type freq.
SLD corpus
the online dictionary thefreedictionary.com, the youtube.com text/html 95,864,655 1.0000 0.0250
twitter.com text/html 588,472 0.9992 0.0002
search engine google.com, travel and booking sites myspace.com text/html 276,498 0.9978 0.0001
like hotels.com and tripadvisor.es, and photo pinterest.com text/html 270,704 0.9762 0.0001
facebook.com text/html 212,543 0.9992 0.0001
sharing sites such as flickr.com and tumblr.com. linkedin.com text/html 160,558 0.9996 0.0000
Once again, out of all top 10 domains by document
frequency and byte size only one is not from the .com Table 3: A video-sharing and social websites by media
TLD, however, federal-hotel.com is a French hotel type.
booking site.
Tables 2c and 2d list top 10 non-.com sites by
document frequency and data. Although .jp is
Character encoding
ranked 12th by document frequency in the corpus,
there are four Japanese sites in the top 10 list: Explicitly specifying the character encoding for a
yahoo.co.jp, the shopping site rakuten.co.jp, the given document ensures that its text can be properly
wiki site atwiki.jp and the Japanse-English online represented and further processed. Although utf-8
dictionary weblio.jp. is the dominant encoding in the internet, it contains
character sets for all scripts and languages, 43% of
Another interesting fact is that for the video-
the crawled documents did not have the encoding
sharing website youtube.com almost all documents
specified. A detailed summary is given in Figure 4.
are HTML text as summarized in Table 3. The same
seems to be the case for social websites like facebook. Table 4 lists a number of top-level domains of coun-
com, twitter.com, myspace.com, pinterest.com tries which use mainly non-latin scripts. For websites
and linkedin.com. In contrast to youtube, however, under these TLDs the correct encoding information
these social websites only account for a negligible is crucial to avoid encoding errors. Nevertheless, Chi-
portion of the corpus. This might be explained by nese (.cn), Japanese (.jp) and Urdu (.pk) have a
the fact that activities on these sites are not part of much higher ratio of websites with unknown encoding
the general web that is accessible by a web crawler. than the average top level domains.

Page 4 of 6
Character Rel.
Abs. freq.
encoding freq.
utf-8 1,866,333,314 0.4874
unknown 1,647,477,248 0.4303
iso-8859-1 229,671,038 0.0600
windows-1251 26,798,707 0.0070
iso-8859-2 10,088,397 0.0026
iso-8859-15 8,605,343 0.0022
windows-1256 5,454,253 0.0014
shift-jis 5,289,261 0.0014
windows-1252 5,173,227 0.0014
euc-jp 4,201,400 0.0011
others 18,074,100 0.0047

Figure 4: Distribution of character encodings.

Rel.
Media type Abs. freq.
freq.
text/html 3,532,930,141 0.9227
application/pdf 92,710,175 0.0242
text/xml 80,184,383 0.0209
text/css 22,872,511 0.006
application/x-
21,198,040 0.0055
javascript
image/jpeg 14,116,839 0.0037
applica-
11,548,630 0.003
tion/javascript
text/plain 10,713,438 0.0028
applica-
6,648,861 0.0017
tion/msword
application/xml 4,999,123 0.0013
applica-
4,200,583 0.0011
tion/rss+xml

Figure 5: Distribution of media types.

Page 5 of 6

You might also like